On-Chip Communication Network: User Manual V1.0.1: Occn U M - V1.0.1 14 O, 2003
On-Chip Communication Network: User Manual V1.0.1: Occn U M - V1.0.1 14 O, 2003
On-Chip Communication Network: User Manual V1.0.1: Occn U M - V1.0.1 14 O, 2003
1 14 OCTOBER, 2003
OCCN
On-Chip Communication Architecture
Abstract
The On-Chip Communication Network (OCCN) project provides an efficient, open-
source, GNU-GPL licensed framework, developed within SourceForge for the
specification, modeling, simulation, and design exploration of network on-chip (NoC)
based on an object-oriented C++ library built on top of SystemC. OCCN is shaped by
our experience in developing communication architectures for different System-on-Chip
(SoC). OCCN increases the productivity of developing communication driver models
through the definition of a universal communication API. This API provides a new design
pattern that enables creation and reuse of executable transaction level models (TLMs)
across a variety of SystemC-based environments and simulation platforms. It also
addresses model portability, simulation platform independence, interoperability, and
high-level performance modeling issues.
1. Introduction
Due to steady downscaling of CMOS device dimensions, manufacturers are increasing
the amount of functionality on a single chip. It is expected that by the year 2005, complex
systems, called Multiprocessor System-on-Chip (MPSoC), will contain billions of
transistors. The canonical MPSoC view consists of a number of processing elements
(PEs) and storage elements (SEs) connected by a complex communication architecture.
PEs implement one or more functions using programmable components, including
general purpose processors and specialized cores, such as digital signal processor (DSP)
and VLIW cores, as well as embedded hardware, such as FPGA or application-specific
intellectual property (IP), analog front-end, peripheral devices, and breakthrough
technologies, such as micro-electro-mechanical structures (MEMS) [18] and micro-
electro-fluidic bio-chips (MEFS) [61] .
PE
SE
PE On-Chip PE
Communication
Architecture
Figure 1. MPSoC configured with on-chip communication architecture, processing, and storage elements
The Spin NOC, proposed by the University of Pierre and Marie Curie - LIP6, uses packet
switching with wormhole routing and input queuing in a fat tree topology. It is a scalable
network for data transport, but uses a bus network for control. It is a best-effort network,
optimized for average performance, e.g. by the use of optimistic flow control coupled
with deflection routing. Commitment is given for packet delivery, but latency bounds are
only given statistically. However, input queuing causes head-of-line blocking effects,
thus being a limiting factor for providing a latency guaranty for the data network.
The Raw network tries to implement a simple, highly parallel VLSI architecture by fully
exposing low-level details of the hardware to the compiler, so that the compiler (or the
software) can determine and implement the best allocation of resources, including
scheduling, communication, computation, and synchronization, for each possible
application. Raw implements fine-grain communication between local, replicated
processing elements and, thus, is able to exploit parallelism in data parallel applications,
such as multimedia processing.
OCCA choice is critical to performance and scalability of MPSoC1. An OCCA design for
a network processor, such as MIT’s Raw network on-chip, will have different
communication semantics from another OCCA design for multimedia MPSoC.
Furthermore, for achieving cost-effectively OCCA scalability, we must consider various
architectural, algorithmic, and physical constraint issues arising from Technology [37].
Thus, within OCCA modeling we must consider architecture realizability and
serviceability. Although efficient programmability is also important, it relates to high-
level communication and synchronization libraries, as well as system and application
software issues that fall outside of the OCCA scope [27].
1
SoC performance varies up to 250% depending on OCCA, and up to 600% depending on communication traffic [3].
The new nanometer technologies provide very high integration capabilities, allowing the
implementation of very complex systems with several billions of transistors on a single
chip. However, two main challenges should be addressed.
• How to handle escalating design complexity and time-to-market pressures for
complex systems, including partitioning into interconnecting blocks,
hardware/software partitioning of system functionality, interconnect design with
associated delays, synchronization between signals, and data routing.
• How to solve issues related to the technologies themselves, such as cross-talk
between wires, increased impact of the parasitic capacitance and resistors in the
global behavioral of system, voltage swing, leakage current, and power consumption.
There is no doubt that future NoC systems will generate errors, and their reliability
should be considered from the system-level design phase [20]. This is due to the non-
negligible probability of failure of an element in a complex NoC that causes transient,
intermittent, and permanent hardware and software errors, especially in corner situations,
to occur anytime. Thus, we characterize NoC serviceability with corresponding
reliability, availability, and performability metrics.
• Reliability refers to the probability that the system is operational during a specific
time interval. Reliability is important for mission-critical and real-time systems, since
it assumes that system repair is impossible. Thus, reliability refers to the system's
ability to support a certain quality of service (QoS), i.e. latency, throughput, power
consumption, and packet loss requirements in a specified operational environment.
Notice that QoS must often take into account future traffic requirements, e.g. arising
from multimedia applications, scaling of existing applications, and network evolution,
as well as cost vs. productivity gain issues.
• System dependability and maintainability models analyze transient, intermittent, and
permanent hardware and software faults. While permanent faults cause an irreversible
system fault, some faults last for a short period of time, e.g. nonrecurring transient
faults and recurring intermittent faults. When repairs are feasible, fault recovery is
usually based on detection (through checkpoints and diagnostics), isolation, rollback,
and reconfiguration. Then, we define the availability metric as the average fraction of
time that the system is operational within a specific time interval.
• While reliability, availability and fault-recovery are based on two-state component
characterization (faulty, or good), system performability measures degraded system
operation in the presence of faults, e.g. increased congestion, packet latency, and
distance to destination when there is no loss (or limited loss) of system connectivity.
The rapid evolution of Electronic System Level (ESL) methodology addresses MPSoC
design. ESL focuses on the functionality and relationships of the primary system
components, separating system design from implementation. Low-level implementation
issues greatly increase the number of parameters and constraints in the design space, thus
extremely complicating optimal design selection and verification efforts. Similar to near-
optimal combinatorial algorithms, e.g. travelling salesman heuristics, ESL models
effectively prune away poor design choices by identifying bottlenecks, and focus on
closely examining feasible options. Thus, for the design of MPSoC, OCCA (or NoC)
design space exploration based on analytical modeling and simulation, instead of actual
Notice that virtual SoC prototype may hide, modify or omit SoC properties. As shown in
Figure 2, abstraction levels span multiple levels of accuracy, ranging from functional- to
transistor-level. Each level introduces new model details [30]. We now describe
abstraction levels, starting with the most abstract and going to the most specific.
Functional models usually have no notion of resource sharing or time. Thus, functionality
is executed instantaneously, or as an ordered sequence of events as in a functional TCP
model, and the model may or may not be bit-accurate. This layer is suitable for system
concept validation, functional partitioning between control and data, including abstract
Register-transfer level models (RTL) correspond to the abstraction level from which
synthesis tools can generate gate-level descriptions (or netlists). RTL systems are usually
visualized as having two components: data and control. The data part is composed of
registers, operators, and data paths. The control part provides the time sequence of signals
that evoke activities in the data part. Data types are bit-accurate, interfaces are pin-
accurate, and register transfer is accurate. Propagation delay is usually back annotated
from gate models.
Gate models are described in terms of primitives, such as logic with timing data and
layout configuration. For simulation reasons, gate models may be internally mapped to a
continuous time domain, including currents, voltages, noise, clock rise and fall times.
Storage and operators are broken down into logic implementing the corresponding digital
functions, while timing for individual signal paths can be obtained.
maintenance costs. Furthermore, notice that in the VHDL model description, it is not
possible to distinguish code used specifically for SRAM control from code referring
solely to the data path [57, 61].
/* core functionality written in C */
/* sram_func.c
#define SRAM_SIZE 128
//Sram.cpp file
sram::sram(sc_module_namename):sc_module(name) // SystemC wrapper
{ SC_THREAD(behavior); }
void sram::behavior()
{ struct {
int32 data, addr;
bool rnw } d;
while (true) {
in->read(d);
if (d.rnw) out->write(sram_read(d.addr));
else
sram write(d.addr,d.data) }
In order to provide system performance measurements, e.g. throughput rates, packet loss,
or latency statistics, system computation (behavior) and communication models must be
annotated with an abstract notion of time. Analysis using parameter sweeps helps
estimate the sensitivity of high-level design due to perturbations in the architecture, and
thus examine the possibility of adding new features in derivative products [8, 9].
However, architectural delays are not always cycle-accurate. For example, for
computational components mapped to a particular RTOS or large communication
transactions mapped to a particular shared bus, it is difficult to accurately estimate thread
delays, which depend on precise system configuration and load. Similarly, for deep sub-
micron technology, wire delays that dominate protocol timings can not be determined
until layout time. Thus, one must include all necessary synchronization points and/or
interface logic in order to avoid deadlocks or data races and ensure correct behavior
independent of computation or communication delays.
All these abstract models can be analyzed, verified and validated in stand-alone manner.
However, detailed VHDL or Verilog models are inadequate for system level description
due to poor simulation performance, and a higher abstraction level is desirable. Thus,
when high-level views and increased simulation speed are desirable, systems may be
modeled at the transaction level using an appropriate abstract data type (ADT), e.g.
processor core, video line, or network communication protocol-specific data structure.
With transactional modeling, the designer is able to focus on IP functionality, rather than
on detailed data flows or hardware details of the physical interface, e.g. FIFO sizes and
time constraints.
Communication Refinement
above it. Thus, an upper layer always depends on the lower layer, but never the other way
round. An advantage of layering is that the method of passing information between layers
is well specified, and thus changes within a protocol layer are prevented from affecting
lower layers. This increases productivity, and simplifies design and maintenance of
communication systems.
• The packet layer breaks operations into one or more request/response packet pairs
depending on the complexity of the protocol, the amount of data to be transferred, and
the width of the interface. Each packet carries a fixed quantum of data directly
mapped onto the interconnection architecture. Thus, for STBus, the amount of
information to be transferred in the request phase is either the same (type 1/type 2), or
differ (type 3 interface) from that transferred in the response phase
• The cell layer breaks these packets into a series of cells, each cell having the right
width to match the packet route through the on-chip communication architecture.
• Finally, the physical layer is responsible for physical encoding of these cells, adding
outbound framing and flow control information. Simple protocol verification may be
performed for each layer in the hierarchy.
Interfaces
Module Module
Access to a channel is provided through specialized ports (small red squares in Figure 6.
For example, for the standard sc_fifo channel two specializations are provided:
sc_fifo_in<T> and sc_fifo_out<T>. They allow FIFO ports to be read and written
without accessing the interface methods. Hereafter, they are referred to as Port API.
void produce() {
const char *str = “hello world!";
while(*str){ out.write(*str++); } // call API of “out”
};
Alike OSI layering, OCCN methodology for NoC establishes a conceptual model for
inter-module communication based on layering, with each layer translating transaction
requests to a lower-level communication protocol. As shown in Figure 7 OCCN
methodology defines three distinct OCCN layers. The lowest layer provided by OCCN,
called NoC communication layer, implements one or more consecutive OSI layers
starting by abstracting first the Physical layer. For example, the STBus NoC
communication layer abstracts the physical and data link layers. On top of the OCCN
protocol stack, the user-defined application layer maps directly to the application layer of
the OSI stack. Sandwiched between the application and NoC communication layers lies
the adaptation layer that maps to one or more middle layers of the OSI protocol stack,
including software and hardware adaptation components. The aim of this layer is to
provide, through efficient, inter-dependent entities called communication drivers, the
necessary computation, communication, and synchronization library functions and
services that allow the application to run. Although adaptation layer is usually user-
defined, it utilizes functions defined within the OCCN communication API.
Sw Adaptation OCCN Model
Application Layer
Application API
Adaptation Layer
Hw Adaptation
Communication API
NoC Communication Layer
control, data or message management, error handling, and various support services to the
application software.
The fundamental components of the OCCN API are the Protocol Data Unit (Pdu), the
MasterPort and SlavePort interface, and high-level system performance modeling. These
components are described in the following sections.
or jobs in a queuing network. Thus, Pdus are a fundamental ingredient for implementing
inter-module (or inter-PE) communication using arbitrarily complex data structures.
A Pdu is essentially the optimized, smallest part of a message that can be independently
routed through the network. Messages can be variable in length, consisting of several
Pdus. Each Pdu usually consists of various fields.
• The header field (sometimes called protocol control information, or PCI) provides
the destination address(es), and sometimes includes source address. For variable size
Pdus, it is convenient to represent the data length field first in the header field. In
addition, routing path selection, or Pdu priority information may be included.
Moreover, header provides an operation code that distinguishes: (a) request from
reply Pdus, (b) read, write, or synchronization instructions, (c) blocking, or
nonblocking instructions, and (d) normal execution from system setup, or system test
instructions. Sometimes performance related information is included, such as a
transaction identity/type, and epoch counters. Special flags are also needed for
synchronizing accesses to local communication buffers (which may wait for network
data), and for distinguishing buffer pools, e.g. for pipelining sequences of
nonblocking operations. In addition, if Pdus do not reach their destinations in their
original issue order, a sequence number may be provided for appropriate Pdu
reordering. Furthermore, for efficiency reasons, we will assume that the following
two fields are included with the Pdu header.
Ø The checksum (CRC) decodes header information (and sometimes data) for error
detection, or correction.
Ø The trailer consisting of a Pdu termination flag is used as an alternative to a Pdu
length sub-field for variable size Pdus.
• The data field (called payload, or service data unit, or SDU) is a sequence of bits that
are usually meaningless for the channel. A notable exception is when data reduction
is performed within a combining, counting, or load balancing network.
Basic Pdus in simple point-to-point channels may contain only data. For complicated
network protocols, Pdus must use more fields, as explained below.
• Remote read or DMA includes header, memory address, and CRC.
• Reply to remote read or DMA includes header, data, and CRC.
• Remote write includes header, memory address, data, and CRC.
• Reply from remote write includes header and CRC.
• Synchronization (fetch&add, compare&swap, and other read-modify-write
operations) includes header, address, data, and CRC.
• Reply from synchronization includes header, data, and CRC.
• Performance-related instructions, e.g. remote enqueue may include various fields to
access concurrent or distributed data structures.
Furthermore, within the OCCN channel, several important routing issues involving Pdu
must be explored (see Section 1). Thus, OCCN defines various functions that support
simple and efficient interface modeling, such as adding/striping headers from Pdus,
copying Pdus, error recovery, e.g. checkpoint and go-back-n procedures, flow control,
segmentation and re-assembly procedures for adapting to physical link bandwidth,
service access point selection, and connection management. Furthermore, the Pdu
specifies the format of the header and data fields, the way that bit patterns must be
interpreted, and any processing to be performed (usually on stored control information) at
the sink, source or intermediate network nodes.
The Pdu class provides modeling support for the header, data field and trailer as
illustrated in the following C++ code block.
template <class H, class BU, int size>
class Pdu {
public:
H hdr; // header (or PCI)
BU body[size]; // data (or SDU)
Depending on the circumstances, OCCN Pdus are created using four different methods.
Always HeaderType (H) is a user-defined C++ struct, while BodyUnitType (BU) is
either a basic data type, e.g. char and int, or an encapsulated Pdu; the latter case is
useful for defining layered communication protocols.
• Define a simple Pdu containing a body of length many elements of BodyUnitType:
Pdu<BodyUnitType, length> pk2
• Define a simple Pdu containing only a body of BodyUnitType:
Pdu<BodyUnitType> pk1
• Define a Pdu containing a header and a body of BodyUnitType:
Pdu<HeaderType, BodyUnitType> pk3
• Define a Pdu containing a header and a body of length many elements of
BodyUnitType:
Pdu<HeaderType, BodyUnitType, length> pk4
Processes access Pdu data and control fields using the following functions.
• The occn_hdr(pk, field_name) function is used to read or write the Pdu header.
• The standard operator “=” is used to
Ø read or write the Pdu body,
Ø copy Pdus of the same type.
• The operator s “>>”and “<<” are used to
Ø send or receive Pdu from input/output streams, and
Ø segmentation and re-assembly Pdus.
Similarly, we can define a Pdu containing a body with a single character as follows.
Pdus may be duplicated as follows. This operation is useful for sending a Pdu while
keeping a local copy.
Pdu<char> pk1, pk2;
pk2=pk1;
As shown in Figure 8, layered protocols may also encapsulate a Pdu from another layer.
The following example shows all operations that an N-layer protocol has to perform in
order to build an N-layer Pdu, and subsequently an N+1-layer Pdu.
typedef Pdu<n1_pci,char> N1pdu // declare a Pdu for layer N+1
N1pdu p1, p2;
typedef Pdu<n_pci,N1pdu> Npdu // declare a Pdu for layer N
NPdu p3,p4;
p1='a';// p1 contains 'a'
p3=p1; // p3 contains p1 (add operation)
p4=p3; // p4 is equal to p3 (copy operation)
p2=p4; // p2 assumes the value of the body (strip operation)
char tmp=pk1;
// since both the body of p1 and tmp are equal to ‘a’,
// then, the cout statement is executed
if (pk1==’a’) && (tmp==’a’)
cout << "pk1 == b" << endl;
P T Data
P T Control
Figure 9. A simple token DS LINK (IEEE 1355)
As an example, let us consider the high-speed digital serial links known as IEEE1355.
IEEE 1355 specifies the physical media and low-level protocols for a family of serial
scalable interconnect systems. Pdus defined in the character layer are called tokens. They
include a parity bit (P) plus a control bit (T) used to distinguish between data and control
tokens. In addition, a data token contains 8 bits of data, and control tokens contain two
bits to denote the token type. This is illustrated in Figure 9. The OCCN modeling
structure is as follows.
struct DSLINK_token {
uint P :1;
uint T :1; }
Pdu<DSLINK_token, char> pk1, pk2;
occn_hdr(pk1,P)=1; // parity field in pk1 is set equal to 1
occn_hdr(pk1,T)=0; // we set pk1 as data token, because T=0
uint tmp = occn_hdr(pk1,P); // tmp is set equal to 1
char body = ’a’;
pk1 = body; // pk1 contains ‘a’
pk2 = pk1; // now pk2 and pk1 are the same, since “=” copies Pdus
char x = pk2; // x assume the value of ‘a’;
// since pk1 is equal to pk2 and x is equal to ‘a’,
// the cout statement is executed
if ((pk1 == pk2) && (x == ’a’))
cout << "pk1 == pk2" << endl;
H Preamble 7 bytes
e
a SFD 1 byte
d Dest Address 6 bytes
e
r Source Address 6 bytes
Length 2 bytes
B
o Data
1500 bytes
d
y 4 bytes
Trailer FCS
Alternatively, as illustrated in Figure 10, a complex Pdu containing both header and body
for a local area network CSMA/CD may be defined as follows.
struct CSMA_CD_crtl {
int8 preamble[7];
int8 sfd;
int8 dest[6];
int8 source[6];
int16 length;
int32 FCS;
};
// first argument is the type, while last one is the size of the body
Pdu< CSMA_CD_ctrl, char, 1500> pk1, pk2;
occn_hdr(pk1,sfd)=3; // set sfd field in pk1 to 3
int8 tmp=occn_hdr(pk1,sfd); // set tmp to 3
char body[1500] =”The body of a Pdu”;
pk1=body; // pk1 contains ”I am the body of a Pdu” (all data copied)
pk2=”I am an OCCN Pdu”; // pk2 contains ”I am an OCCN Pdu”;
Segmentation and re-assembly are very common operations in network protocols. Thus,
OCCN supports the overloaded operators “>>”, “<<”. The following is a full
segmentation/ reassembly example.
typedef struct { int32 seq;} Headermsg;
Pdu< Headermsg, char, 4 > pk0, pk1, pk2, pk3;
Pdu< Headermsg, char, 8> msg1, msg2;
occn_hdr(msg1, seq)=13;
msg1=”abcdefgh”; // msg1 contains ‘abcdefgh’;
The Pdu is a key element in the OCCN API, with its structure being determined by the
corresponding OCCA model. In this Section we describe the transmission/reception
interface of the OCCN API. The paradigm used for this interface is message-passing,
with send and receive primitives for point-to-point and multi-point communication. If the
Pdu structure is determined according to a specific OCCA model, the same base
functions are required for transmitting the Pdu through almost any OCCA. A great effort
is dedicated to define this interface as a reduced subset of functions providing users with
a complete and powerful semantic. In this manner, we can achieve model reuse and inter-
module communication protocol refinement through a generic OCCA access interface.
Thus, the same base functions are used for transmitting a Pdu for almost any OCCA, e.g.
AMBA, STBus, Sonics, or Core connect.
For message passing, there are two major point-to-point send/receive primitives:
synchronous and asynchronous. Synchronous primitives are based on acknowledgments,
while asynchronous primitives usually deposit and remove messages to/from application
and system buffers. Within the class of asynchronous point-to-point communications,
there are also two other major principles: blocking and non-blocking. While non-blocking
operations allow the calling process to continue execution, blocking operations suspend
execution until receiving an acknowledgment or timeout. Although, we often define
various buffer-based optimization-specific principles, e.g. the standard, buffered, and
ready send/receive in the MPI standard, we currently focus only on the major
send/receive principles.
Synchronous blocking send/receive primitives offer the simplest semantics for the
programmer, since they involve a handshake (rendezvous) between sender and receiver.
• A synchronous send busy waits (or suspends temporarily) until a matching receive is
posted and receive operation has started. Thus, the completion of a synchronous send
guarantees (barring hardware errors) that the message has been successfully received,
and that all associated application data structures and buffers can be reused. A
synchronous send is usually implemented in three steps.
Ø First, the sender sends a request-to-send message.
Ø Then, the receiver stores this request.
Ø Finally, when a matching receive is posted, the receiver sends back a permission-
to-send message, so that the sender may send the packet.
• Similarly, a synchronous receive primitive busy waits (or suspends temporarily) until
there is a message to read.
With asynchronous blocking operations we avoid polling, since we know exactly when
the message is sent/received. Furthermore, in a multi-threaded environment, a blocking
operation blocks only the executing thread, allowing the thread scheduler to re-schedule
another thread for execution, thus resulting in performance improvement. The
communication semantics for point-to-point asynchronous blocking primitives are
defined as follows.
• The blocking send busy waits (or suspends temporarily) until the packet is safely
stored in the receive buffer (if the matching receive has already been posted), or in a
temporary system buffer (message in care of the system). Thus, the sender may
overwrite the source data structure or application buffer after the blocking send
operation returns. Compared to a synchronous send, this allows the sending process to
resume sooner, but the return of control does not guarantee that the message will
actually be delivered to the appropriate process. Obtaining such a guarantee would
require additional handshaking.
• The blocking receive busy waits (or suspends temporarily) until the requested
message is available in the application buffer. Only after the message is received, the
next receiver instruction is executed. Unlike a synchronous receive, a blocking
receive does not send an acknowledgment to the sender.
• A non-blocking send initiates the send operation, but does not complete it. The send
returns control to the user process before the message is copied out of the send buffer.
Thus, data transfer out of the sender memory may proceed concurrently with
computations performed by the sender after the send is initiated and before it is
completed. A separate send completion function, implemented by accessing (probing)
a system communication object via a handle, is needed to complete the
communication, i.e. for the user to check that the data has been copied out of the send
buffer, so that the application data structures and buffers may be reused2. These
functions either block until the desired state is observed, or return control
immediately reporting the current send status.
• Similarly, a non-blocking receive initiates the receive operation, but does not
complete it. The call will return before a message is stored at the receive buffer. Thus,
data transfer into receiver memory may proceed concurrently with computations
performed after receive is initiated and before it is completed. A separate receive
completion function, implemented by accessing (probing) a system communication
object via a handle, is needed to complete the receive operation, i.e. for the user to
verify that data has been copied into the application buffer1. These probes either block
until the desired state is observed, or return control immediately reporting the current
receive status.
The precise type of send/receive semantics to implement depends on how the program
uses its data structures, and how much we want to optimize performance over ease of
programming and portability to systems with different semantics. For example,
asynchronous sends alleviate the deadlock problem due to missing receives, since
processes may proceed past the send to the receive process. However, for non-blocking
asynchronous receive, we need to use a probe before actually using the received data.
2
Blocking send (receive) is equivalent to non-blocking send (resp., receive) immediately followed by a
blocking send (resp., receive) completion function call.
process may return later to check the status of the dispatched Pdu, e.g. by establishing
a handshake.
from reply ensures that the channel send/receive operation is completed and that
the receiver is synchronized with the sender. The following code is used for
receiving a Pdu.
sc_time timeout = ….;
bool received;
// Suppose that in is an OCCN SlavePort
Pdu<…> *msg = in.receive(timeout, received);
if (!received)
// timeout expired: received Pdu not valid
else
// received Pdu is valid; user may perform elaboration on Pdu
reply(); // synchronizing sender & receiver after 0 bus cycles
Notice that when the delay of a transaction is measured in terms of bus cycles, OCCN
assumes that the channel is the only one to have knowledge of the clock, allowing
asynchronous processes to be connected to synchronous clocked communication media.
In both cases the latency of reply can be fixed or dynamically calculated after the
receive, e.g. as a function of the received Pdu.
Furthermore, notice that a missing reply to a synchronous send could cause a deadlock,
unless a sender timeout value is provided. In the latter case, we allow that the Pdu
associated with the missing acknowledgment is lost. Notice that a received Pdu is also
lost, if it is not accessed before the corresponding reply.
Sometimes tasks may need to check, enable, or disable Pdu transmission or reception, or
extract the exact time(s) that a particular communication message arrived. These
functions enable optimized modeling paradigms that are accomplished using appropriate
OCCN channel setup and control functions. A detailed description of these functions falls
outside the scope of this document [15].
Another OCCN feature is protocol in-lining, i.e. the low-level protocol necessary to
interface a specific OCCA is automatically generated using the standard template feature
available in C++ enabled by user-defined data structures. This implies that the user does
not have to write low-level communication protocols already provided by OCCN, thus,
making instantiation and debugging easier. Savings are significant, since in today’s
MPSoC there are 20 or more ports, and 60 to 100 signals per port.
Sender Transmission delay
Overhead (size ÷ bandwidth) Sender
Transport Delay
Receiver
Communication Delay
Time Time
Module A Module B
Pdu send
sender overhead +
transmission delay
propagation and
congestion delay
Asynchronous blocking send is similar, except that a Pdu acknowledgment does not exist,
and thus, asend exits soon.
In Figure 14, we show state changes during OCCN synchronous and asynchronous point-
to-point communications. A process which calls send() is “Send Blocked” until its
message is received,. A process which calls receive() is “Receive-blocked“ until it
receives a message. A process which calls asend() is never blocked, . The only
condition to be blocked is when it tries to send something when the channel is already
transmitting. A process which calls areceive() is “Receive-blocked“ until it receives a
message.
asend()
Receive receive()
Ready
Blocked
asend()
This process
Other process()
Figure 14. OCCN protocol send/receive transaction
Using the above send/receive functions and appropriate channel setup and control
functions, any kind of on-chip communication protocol can be modeled. For instance, we
can easily model a Request Grant Valid (RGV) protocol using the synchronous blocking
paradigm for the request grant part and the asynchronous blocking scheme for the Valid
part. RGV is a complex protocol shipping today in millions of SoC devices where high
bandwidth peripherals (like MPEG decode hardware) share the system bus with a host
CPU and many other peripherals of varying performance and system demands.
System-level modeling using SystemC is an essential ingredient of SoC design flow [53].
Data and control flow abstraction of the system hardware and software components
express not only functionality, but also performance characteristics that are necessary to
identify system bottlenecks. While for software components it is usually the
responsibility of the user to provide appropriate performance measurements, for hardware
components and interfaces it is necessary to provide a statistical package that hides
internal access to the modeling objects. This statistical package can be used for
evaluating SoC performance characteristics. The statistical data may be analyzed online
using visualization software, e.g. the open source Grace tool, or dumped to a file for
subsequent data processing, e.g. via an electronic spreadsheet or a specialized text editor.
For SoC modeling, we usually monitor simple dynamic performance characteristics, e.g.
latency, throughput, and possibly power consumption (switching activity) for obtaining
information on the effectiveness of intra-module computation, and inter-module
communication and synchronization components. Although similar metrics for intra-
module communication and synchronization objects can be defined, these components do
not correspond to unique hardware components, since they may be implemented in
various ways.
Furthermore, for 2D graphs, the statistical API can be based on a possibly overloaded
enable_stat()function that specifies the absolute start and end time for statistics
collection, the title and legends for the x and y axes, the time window for window
statistics, i.e. the number of consecutive points averaged in order to generate a single
statistical point, and the unique (over all modules) object name. Notice that for a
particular statistic, e.g. for read throughput, the basic graph layout (including the title and
x, y axes) is similar, with the legends differentiating only across different objects. Thus,
in order to distinguish the corresponding statistical graphs, the user must provide distinct
names to all modeling objects, even if they exist within different modules.
Since complex systems involve both time-driven (instant) and event-driven (duration)
statistics, we may provide two general monitoring classes, collecting instant and duration
measurements from system components with the following functionality.
• In time-driven simulation, signals usually have instantaneous values. During
simulation, these values are recorded by calling a library-provided public member
function called stat_write.
• In event-driven simulation, recorded statistics for events must include arrival and
departure time (or duration). Since the departure time is known at some point later in
time, the interface can be based on two public member functions.
Ø First a stat_event_start function call records the arrival time, and saves in a
local variable the unique location of the event within the internal table of values.
Ø Then, when the event’s departure time is known, this time is recorded within the
internal table of values at the correct location by calling the stat_event_end
function with the appropriate departure time.
Since the above method enables simple and precise measurement, it is useful for
obtaining system performance metrics. The statistics collection based on stat_write
and stat_event_start/end operations may either be performed by the user, or
directly by a specialized on-chip model, e.g. APB, AHB, CoreConnect, and StBus, using
library-internal object pointers. In the latter case, software probes are inserted into the
source code of library routines, either manually by setting sensors and actuators, or more
efficiently through the use of a monitoring segment which automatically compiles the
necessary probes. Such software probes share resources with the system model, thus
offering small cost, simplicity, flexibility, portability, and precise application
performance measurement in a timely, frictionless manner.
Furthermore, observe that from the above classes representing time- and event-driven
statistics, we may derive specialized classes on which average and instant throughput,
latency, average and instant size, packet loss, and average hit ratio statistics can be based.
Thus, for example, the event-driven statistic class allows the derivation of simple
duration statistics based on an enable_stat_delay library function for Register, FIFO,
memory, and cache objects.
In addition to the previously described classes that cover all basic cases, it is sometimes
necessary to combine statistical data from different modeling objects, e.g. for comparing
average read vs. write access times in a memory hierarchy, or for computing cell loss in
ATM layer communications. For this reason, we need new joint or merged statistic
classes that inherit from time- and event-driven statistics. Special parameters, e.g.
boolean flags, for these joint detailed statistic classes can lead to detailed statistics.
We now describe our general methodology for collecting statistics from system
components.
Figure 15 illustrates the software components within the OCCN statistical package that
include the following C++ statistical classes:
• top level, general classes, such as BaseStat and GuiStat (now implemented as a
set of visualization functions),
• publicly available generic classes, such as StatInstant and StatDuration,
• derived time-driven and event-driven classes, such as StatInstantAvg, and
StatInstantCAvg,
• application-specific functions enabling certain performance metrics, including
StatThroughput, StatDelay, StatSize (and StatInstantSize), StatProb
(and StatInstantProb) classes.
StatDelay
The gui_stat class is gui-specific, i.e. for the current OCCN version it is targeted to the
Grace™ environment. This class includes substantial details regarding the parameters for
the graph layout, e.g. title, axis, ticks, legends and viewport, and also the Grace™ socket
interface. However, for the purpose of this document, it is sufficient to say that this class
includes the necessary routines for visualizing both time-driven and event-driven
statistics.
deviation; notice that sum is useful, e.g. in computing the total number of packets
received or transmitted by a module,
• a uniform time window refering to cummulative statistics.
• user-defined or possibly customized GUI labels, e.g. for 2-dimensional throughput
graphics, the corresponding titles, units, and tick sizes for x and y-axis; notice that this
particular functionality for common graphs, such as frequency statistics for latency and
throughput, may be provided to the user directly (via an extra user-defined flag).
• a user-defined caption, usually a string describing a particular IP component
instantiated within a user-derived class, e.g. “Mem01_test1” for a statistic test on a
memory named as “Mem01”.
• user-defined log file(s), i.e. for dumping statistical values; these file names may either
be automatically generated from the user-defined caption above, or be provided
explicitly to the GUI.
• a cache_size, as a performance enhancement for the statistics package; this refers to
the maximum number of elements to be stored in the internal table of values before
being transferred to the appropriate GUI or flushed to the file.
• system lock necessary for shared read access to values written by different threads in a
multithreading, different processes in a multiprogramming, and/or different processors
in a parallel or distributed computing environment.
Based on the above description, the C++ implementation of the base_stat class is as
follows.
#define max_stat_params 11
#define max_string_size 90
typedef struct {
double time, value;
} stat_entry; // internal table (StatInstant & StatDuration classes)
typedef struct {
long int object_ptr;
double time, value;
} stat_entry_lib; // internal cache StatDurationLib Class
// for implementing event_driven functions
class base_stat {
public:
// functions defining start/end of statistics collection
void start_stat_time(double _start);
void end_stat_time(double _end);
protected:
double start_time, end_time; //start&end-time for stat collection
unsigned int stat_param_flag; //OR of user-provided boolean flags
unsigned int cache_size; // cache size for dumping data
private:
static unsigned int filename_counter; // unique file identity
static unsigned int counter_id; // count number of instances
unsigned int instance_id // id of class instance
// In the base class we may also define:
// mutex lock; // to enable shared access to variables
// user-defined log file(s)
// additional user-defined or possibly customized GUI labels
// user-defined captions identifying IP components
};
In event-driven simulation, recorded statistics for events must include an arrival and a
departure time. Since the departure time is known at some point later in time, we propose
an interface which is based on two public stat_event_start and stat_event_end
functions. Thus, first the user invokes an
• a = int stat_event_start(double arrival_time)
command to record the arrival time and save in variable a the unique location of the
event within the internal table of values. Then, when the event’s departure time is known,
this is recorded within the internal table of values at the correct location, by invoking a
call to
• void stat_event_end(double departure_time, a).
Notice that the event duration can be readily obtained from the above two values. Using
Grace™ this computation can be performed very efficiently using the Edit data
sets: create_new (using Formula) option.
Since complex systems may involve both time-driven (instant) and event-driven
(duration) statistics, we provide the StatInstant and StatDuration classes which
inherit properties from the top-level classes and provide the functionality described
above. The interface for these classes is described below. Implementation is slightly
complicated due to the _time_window parameter. The precise implementation will be
explained in the next release of this manual.
#define MAX_STRING_SIZE 64
class StatInstant : public base_stat {
public:
// constructor for time-driven (instant) stats
StatInstant (unsigned int _cache_size,
unsigned int _time_window);
// class destructor
~StatInstant();
private:
// file pointer used for dumping statistics (ASCII format)
FILE *fp;
// file name for printing statistics
char fname[MAX_STRING_SIZE];
};
public:
// class constructor for event-driven (duration) statistics
StatDuration(unsigned int _cache_size,
unsigned int _time_window);
// class destructor
~StatDuration();
private:
// file pointer used for dumping statistics (ASCII format)
FILE *fp;
// file name for printing statistics
char fname[MAX_STRING_SIZE];
Finally, notice that for the StatDuration class there is a corresponding internal OCCN
library StatDurationLib class with the same functionality. However, this class
performs the basic stat_event_start/end operations using (unknown to the user)
internal OCCN library object pointers. Thus, stat_event_start/end operations also
take into account memory address in order to compute duration for subsequent, e.g.
read/write accesses to the same memory address.
The StatDurationLib class allows for simple duration statistics based on the
enable_stat_delay OCCN library function for various memory objects providing
read/write access, e.g. FIFO objects, Memory, Cache and inter-module communication
objects.
3.3.1.5 Statistic Classes for Throughput, Latency, Size, and Cell Loss
There are two very common performance metrics for evaluating system performance.
• Throughput refers to the average number of bytes routed through the component per
clock cycle. Sometimes throughput is normalized by the maximum possible
throughput.
• Latency refers to the average delay through the component, measured from arrival
time of the first message bit to departure time of the last message bit. Latency may
include various delays, such as link propagation delay, queuing time, as well as
arbitration time.
As described above the application user interface for the statistical functions includes the
base class called base_stat and gui_stat, as well as two inherited public classes:
StatInstant and StatDuration.
private:
double bytes_counter;
};
private :
double old_cum_sum;
unsigned int total_no_samples;
};
Using derived classes from StatInstant and the StatDuration class that represent
time- and event-driven statistics, we can derive the application-specific classes instant
throughput (cummulative average over only a time window), cummulative throughput
(cummulative average over many consecutive time windows) and latency statistics:
StatInstantput, StatThroughput and StatDelay. For these classes, we provide
enable functions in order to initialize parameters, such as the absolute start and end time
for statistics collection, the title and legends for the x and y axes, and the time window
for window statistics. We also provide a boolean flag for stopping or resuming statistics
collection during execution.
public:
StatThroughput:: StatThroughput(unsigned int _cache_size,
private :
double bytes_counter; // for computing bytes per sec
};
bool STATS_WANTED_INSTANT_SIZE;
};
bool STATS_WANTED_SIZE;
};
bool STATS_WANTED_INSTANT_PROB;
};
At first, let us define a statistical object for measuring throughput during read access from
a memory module called "stat_memory".
StatThroughput ThroughputRead(1, 1); // cache size 1, time window 1
An example of the use of the object ThroughputRead for statistics collection is shown
below. The values for current time and number of bytes read from the memory module
are recorded with the stat_write function.
if (ThroughputRead.STATS_WANTED_THROUGHPUT)
ThroughputRead.stat_write((N_uint64)
(sc_time_stamp().to_default_time_units()), no_bytes_read);
A statistical object for measuring latency during successive write/read accesses from a
memory module called "stat_memory" may be defined as follows.
StatDelay WriteReadDelay(1, 1); // cache size 1, time window 1
// equivalent representation: StatDuration stat_memory(1000,1);
An example of the use of the object WriteReadDelay for statistics collection is shown
below. Notice the use of the stat_event_start and stat_event_end functions in
order to save the current time (and obtain the id) or current time and number of bytes read
from the memory module are recorded with the stat_write function.
// step 1: Calling stat_event_start
int id;
if (WriteReadDelay.STATS_WANTED_DELAY)
id = WriteReadDelay.stat_event_start(
(N_uint64)(sc_time_stamp().to_default_time_units()));
// step 2: Calling stat_event_end
if (WriteReadDelay.STATS_WANTED_DELAY)
(void) WriteReadDelay.stat_event_end(
(N_uint64)(sc_time_stamp().to_default_time_units()), id);
For other statistics classes, definition and use of the statistics is similar to the above two
cases. For more details on the use of all statistics classes and the capability of online
statistical graphs, the reader is referred to the corresponding statistical testbenches that
accompany the OCCN library. However, notice that for the StatInstant (and
StatDuration) classes there are no enable functions, and instead the following base
functions are used.
StatInstant s1(1000,1);
// similar for StatDuration
s1.activate_stats(BaseStat::FREQ_STAT_CURR,"str1","uniq_number");
s1.start_sim_time(0);
s1.end_sim_time(100);
// title and legends are defined in the appplication-dependent
// online visualization function (see statistic testbenches)
In addition to the previously described classes which cover all basic cases, it is
sometimes necessary to combine statistical data from different IP components, e.g. from
two different memory units in order to compare average read vs. write access times. For
this reason, we need new joint or merged statistic classes that inherit from the Instant and
Duration classes. Parameters, e.g. necessary boolean flags, for these joint statistic classes
will be examined in a future issue of this document.
Finally, within the parallel and distributed domain, it is possible that simulation metrics
must be combined together with platform performance indicators which focus on
monitoring system statistics (e.g. simulation speed, computation and communication
load) for improving system performance, e.g. through dynamic load balancing.
Conventional text output and source-code debugging are inadequate for monitoring and
debugging complex and inherently parallel SoC models. Similarly, current tools, such as
the Synopsys Cocentric Studio can generate vcd files for signal tracing, or build
relational databases in the form of tables, for data recording, visualization, and analysis.
However, extensive SoC performance-modeling environments may be based on advanced
monitoring issues.
A status report contains a subset of system state information, including object properties,
such as time stamp and identity. This report can be generated either periodically, i.e.
based on a predetermined FSM or Thread schedule, or on demand, i.e. upon receiving a
request for solicited reporting. Notice that the request may itself be periodic, i.e. via
polling, or on a random basis. Thus, appropriate status reporting criteria define the
reporting scheme, the sampling rate, and the contents of each report.
In order to describe the dynamic behavior of an object or group of objects over a period
of time, status and event reports are recorded in time order, as monitoring traces. A
complete monitoring trace contains all monitoring reports generated by the system since
the start of the monitoring session, while a segmented trace is a sequence of reports
collected during a limited period of time. A trace may be segmented due to overflow of a
trace buffer, or deliberate halting of trace generation that results in the loss, or absence of
reports over a period of time.
Notice that within each trace we need to identify the reporting entity, the monitored
object, and the type of the report, including user-defined parameters, e.g. start- and end-
time, time-window, priority, and size. Thus, user-selected parameters may provide
browsing or querying facilities (by name or content), dynamic adjustment (during
runtime) of event occurrence interval, adjustment of intervals before or between event
occurrences, and examination of the order of event occurrence. A monitoring trace may
also be used to generate non-independent traces based on various logical views of objects
or system activity.
A SoC model may generate large amounts of monitoring information. The data collection
phase is only useful, if the data can be used to identify problems and provide corrective
measures. Thus, the data collection phase is split into four different phases.
• sanity tests based on the validity of individual monitoring traces, e.g. by checking for
correct token values in event fields, such as an identity, or time stamp, and
• validation of monitoring reports against each other, e.g. by checking against known
system properties, including temporal ordering.
Filtering minimizes the amount of monitoring data to a suitable level of detail and rate.
For example, filter mechanisms reduce the complexity of displayed process
communications by
• incorporating a variable report structure,
• displaying processes down to a specified level in the module hierarchy,
• masking communication signals and data using filter dialogs, and
• providing advanced filter functionality for displaying only tokens with
predetermined values.
Monitoring reports would have to reach human users, manager, and processing entities.
Dissemination schemes range from very simple and fixed, to very complex and
specialized. Selection criteria contained within the subscription request are used by the
dissemination system to determine which reports (and with what contents, format, and
frequency) should be delivered. This requires implicit filtering, since only the requested
reports are forwarded.
A time-process diagram is a 2D diagram illustrating the current system state and the
sequence of events leading to that state. The horizontal axis represents events
corresponding to various processes, while the vertical one represents time. In
synchronous systems, the unit of time corresponds to a period of actual time, while in
asynchronous systems, it corresponds to an occurrence of an event. In the latter case, the
diagram is called concurrency map, with time dependencies between events shown as
arrows. An important advantage of time-process diagrams is that monitoring information
can be presented either on a simple text screen, or on a graphical one.
An animation captures a snapshot of the current system state. Both textual and graphical
event representations, e.g. input, output, and processing events, can be arranged in a 2D
display window. Graphical representations use various formats, such as icons, boxes,
Kiviat diagrams, histograms, bar charts, dials, X-Y plots, matrix views, curves, pie charts,
and performance meters. Subsequent changes in the display occur in single step or
continuous fashion and provide an animated view of system evolution; for online
animation, the effective rates at which monitoring information is produced and presented
to the display must be matched. For each abstraction level, animation parameters include
enable/disable event monitoring or visualization, select clock precision, monitoring
interval, or level of detail, and view or print statistics.
Although the above monitoring techniques are useful, specifications for the actual
graphical user interface of the monitoring tool depend on the imagination, experience,
and available time of the designer. We present here some desirable features that a SoC
monitoring interface should possess.
• Visualization at different abstraction levels enables the user to observe behavior at
various abstraction levels, e.g. per object, per block, or per system. This stepwise
refinement technique allows the user to start observation at a coarse level and
progressively (or simultaneously) focus on lower levels. At each level, application
and system metrics may be presented in appropriate easy-to-read charts and graphs,
while the communication volume among subsystems (or power consumption) may be
visualized by adjusting the width (or color) of the lines interconnecting the
corresponding modules.
• A history function visualizes inherent system parallelism by permitting the user to
Ø scroll the display of events forwards or backwards in time by effectively changing
a simulated system clock, and
Ø control the speed at which system behavior is observed using special functions for
start/stop and pause/restart event display, single-step or continuous animation, and
real-time or slow-motion animation.
• Visibility of interactions enables the user to dynamically visualize contents of a
particular communication message, component data structure, system or network
configuration, or general system data, such as log files, or filtering results.
• Placement of monitoring information greatly enhances visibility and aids human
comprehension. Placement may either be automatic, i.e. using computational
geometry algorithms, or manual by providing interface functions, e.g. for moving or
The following example illustrates the methods discussed for MasterPort and
SlavePort interfaces. It consists of a synchronous transmitter and receiver. This
example counts the number of words in the standard input, separated by space, tab, and
newline characters. The application is implemented using two threads. Both threads use
the new function provided by the port to avoid making copies of messages between thread
and channel.
• The producer thread generates a finite number of input characters and sends them
through a channel to the consumer thread.
• Concurrently, the consumer thread takes characters out of the channel and counts the
number of words until the input steam is exhausted.
The code for the transmitting and receiving thread are provided below.
producer.h
#include “occn.h” // include message definitions
private:
void read();
};
producer.cc
#include “producer.h”
void producer::read() {
char c;
Pdu<char>* msg;
msg = new(&out) Pdu <char>; // channel allocation
if (! msg)
OCCN_end_sim(-1,"Error during Packet allocation");
while (cin.get(c)) { // producer sends c
*msg= c;
out.send(msg);
}
delete msg;
} // after the send msg is not usable
consumer.h
#include "occn.h"
class consumer : public sc_module { // class description
public:
consumer(sc_module_name name);
SlavePort<Pdu<char> > in; // communication port
SC_HAS_PROCESS(consumer);
private:
void count();
int counter;
};
consumer.cc
#include "consumer.h"
consumer::consumer(sc_module_name name) : sc_module(name),
counter(0)
{SC_THREAD(count);}
void consumer::count() {
bool word_complete=false;
Pdu<char> * msg;
while (1) {
msg = *in.receive(); // receive will allocate the storage
//cout << "consumer receives " << msg->body << endl;
switch (msg) {
case ' ' :
case '\t' :
if (word_complete) {
counter++;
word_complete = false;
}
break;
case '\n' :
if (word_complete) counter++;
cout << "Total: "<< counter << " words" << endl;
OCCN_end_sim(0,"SUCCESS");
break;
default:
word_complete = true;
break;
}
in.reply();
}
}
Notice that since the Pdu msg belongs to the transmitter, the send function makes a local
copy before sending the Pdu through the channel. Since local copies are a drawback in
terms of simulation speed, we let the channel manage creation and destruction of Pdus
using the placement feature available in C++ for the standard new function. Therefore we
provide an overload for the operator new that takes as second argument a MasterPort,
using the function void* operator new(size_t nbytes, MasterPort<…>& port).
This function allocates runtime storage returning a pointer to the newly allocated Pdu.
Since in most cases, one caller manages transmission, the new function supports single-
element dynamic allocation for efficient implementation. Thus, for allocating packet
memory within a channel bound to a port called out, we call the operator new.
Pdu<…> * pk = new(&out) Pdu<…>;
Furthermore, notice that the compiler calls the overloaded operator new with parameters
sizeof(Pdu<…>) and out.
Since Pdus belong to the caller, their destruction is under the responsibility of the caller.
Thus, the transmitter must destroy the msg using delete However, in order in order to
deallocate a previously allocated Pdu<…> object within the channel, the operator delete
is called automatically by the channel before the context switch of the current task.
The standard channel interface (described in “StdChannel.h”) is the most basic channel
for point to point, bi-directional inter-module communication. This channel essentially
models circuit switching, since once a module prepares its signals and/or data, this
information always becomes available to the connected module during the next clock
cycle. The user currently has no control to change the number of cycles required for the
transfer. Furthermore, there is no maximum data size on the Pdu to be transmitted
through this channel.
My_Module_1 My_Module_2
my_prod my_module_3 my_cons
OCCA
Figure 16. IP module components: behavior and inter-module communication
Here is an example of top-level module instantiation and port binding with the
StdChannel. The binding represents the architecture shown in Figure 16.
main.cc
#include “occn.h”
#include "producer.h"
#include "consumer.h"
int main() {
// clock, modules, and bus declaration
sc_clock clock1("clk",10,SC_NS);
producer my_prod("Master");
consumer my_cons("Slave");
// channel instantiation
StdChannel<Pdu<char>,Pdu<char> > channel("channel");
// module binding
channel.clk(clock1);
This Section provides a case study for OCCN methodology, focusing on the user point of
view. This example shows how to develop the Adaptation Layer on top of the basic
OCCN Communication API consisting of the MasterPort and SlavePort classes (see
Figure 11). Using layered OCCN communication architecture, with each layer
performing a well-specified task within the overall protocol stack, we describe a
simplified transport layer, inter-module transfer application from a transmitter to a
specified receiver. The buffer that has to be sent is split into frames. Thus, the
TransportLayer API covers the OSI stack model up to the transport layer, since it
includes segmentation. This API consists of the following basic functions.
• void TransportLayer_send(uint addr, BufferType& buffer); The
destination address addr identifies the target receiver and the buffer to be sent. The
BufferType is defined in the inout_pdu.h code block using the OCCN Pdu object.
• BufferType* TransportLayer_receive(); This function returns the received
buffer data.
NoC is becoming sensitive to noise due to technology scaling towards deep submicron
dimensions [5]; Thus, we assume an unreliable channel and a simple stop-and-wait data
link protocol with negative acknowledgments, called Automatic Repeat reQuest (ARQ),
is implemented [31, 54]. Using this protocol, each frame transmitted by the sender (I-
frame) is acknowledged by the receiver with a separate frame (ACK-frame). A timeout
period determines when the sender has to retransmit a frame not yet acknowledged.
The I-frame contains a data section (called payload) and a header with various fields:
• a sequence number identifying an order for each frame sequence; this information is
used to deal with frame duplication due to retransmission, or reordering out-of-order
messages due to optimized, probabilistic or hot-potato routing; in the latter case,
messages select different routes towards their destination [28];
• a destination address field related to routing issues at Network Layer;
• an EDC (Error Detection Code) enabling error-checking at the Data Link Layer for
reliable transmission over an unreliable channel [59], and
• a source_id identifying the transmitter for routing back an acknowledgment.
The ACK-frame sent by the receiver consists of only a header, with the following fields:
• a positive or negative ack field that acknowledges packet reception according to the
adopted Data Link protocol, and
• a source_id identifying the receiver where to route the acknowledgment.
A frame is retransmitted only if the receiver informs the sender that this frame is
corrupted through a special error code.
From the transmitter (Tx) side, the ARQ protocol works as follows.
• Tx.1: send an I-frame with a proper identifier in the sequence field.
• Tx.2: wait for acknowledgment from the receiver until a timeout expires;
• Tx.3: if the proper acknowledgment frame is received, then send the next I-frame,
otherwise re-send the same I-frame.
From the receiver (Rx) point of view, the ARQ protocol works as follows.
For simplifying our case-study, we assume that no data corruption occurs during the
acknowledgment exchanges. EDC is provided for a future more complex implementation.
Figure 17 illustrates OCCN implementation of our transport layer protocol, using inter-
module communication between two SystemC modules (Transmitter and Receiver)
through a synchronous, point-to-point OCCN channel called StdChannel. This channel
implements the timeout capability (see Section 4.1) and random packet loss by emulating
channel noise.
Transmitter Receiver
API API
Process Process
action_tx Channel
OCCN StdChannel action_rx
Port Port
Application API
Adaptation Layer
Communication API
Figure 17. Transport Layer send/receive implementation with OCCN StdChannel
The StdChannel is accessed through the OCCN Communication API defined in Section
3.2, while the Transmitter and Receiver modules implement the higher-level Application
API defined in Section 4.1. This API is based on Adaptation Layer classes MasterFrame,
and SlaveFrame, specialized ports derived from MasterPort and SlavePort, respectively
(see Figure 18). This SystemC-compliant approach allows design of the communication-
oriented part of the application on top of the OCCN Communication API.
MasterFrame SlaveFrame
Comments are introduced within each code block to illustrate important design issues.
• “inout_pdu.h“ – Pdu definitions for all ports,
With respect to data type definition, we define the buffer and frame data structures using
the OCCN Pdu object. The buffer is an OCCN Pdu without header and a body of
BUFFER_SIZE number of characters. Mapping of the frame to an OCCN Pdu is shown in
Figure 19. For in-order transmission the sequence number can be represented with a
single bit. However, a progressive number is assigned to each frame, partly for clarity
reasons and for the possibility to support unordered transmissions in the future. Since
StdChannel is point-to-point, addr is actually useless, but could be used in a more
general setting. Moreover, a reserved number (ERROR_CODE) in ack indicates an error,
denoting among all non-acknowledgement conditions data corruption detected by EDC.
sequence
HEADER
address ACK-FRAME
PDU
HEADER
EDC
I-FRAME
ack
PDU
source_id
source_id
BODY
PDU
Payload
Figure 19. I- and ACK-frame data structures and corresponding OCCN Pdu objects.
inout_pdu.h
#include <occn.h>
#define BUFFER_SIZE 256
#define ERROR_CODE 0xFFFFFFFF
typedef uint seq_num;
typedef uint EDC_type;
To simplify the code, we assume that the data size of the StdChannel equals the frame
size, i.e. Pdu sent by the OCCN Communication API is directly the I-frame in the
transmitter to receiver direction and the ACK-frame in the opposite. The “inout_pdu.h”
code block provides the Pdu type definitions for inter-module communication. Notice the
definitions of I_FrameType and ACK_FrameType that are used throughout the example.
The Transmitter module implements the SC_THREAD action_tx. A buffer filled with
random letters in [‘A’ ... ’Z’] is sent through the channel by calling the application API
TransportLayer_send function through the MasterFrame sap_tx access port. This
operation is repeated NB_SEQUENCES times. The Transmitter module interface described
in the code block “transmitter.h” includes
• the MasterFrame definition, which is the same as in the Receiver module; thus it is
obtained directly from “inout_pdu.h”.
• the transmission layer name and interface definitions, defined in “MasterFrame.h”.
• the thread name and action (action_tx) routine, and
• other internal objects and variables, such as buffer.
Transmitter.h
#include <systemc.h>
#include <occn.h>
#include "inout_pdu.h"
#include "MasterFrame.h"
#define NB_SEQUENCES 15
Transmitter.cc
#include <stdlib.h>
#include "transmitter.h"
// Transmitter constructor
transmitter::transmitter(sc_module_name name, sc_time time_out)
:sc_module(name), sap_tx(time_out), buffer_tx()
{SC_THREAD(action_tx);}
The Transmitter module provides a transmission layer interface described in code blocks
“MasterFrame.h” and “MasterFrame.cc”. This layer defines a very simple
communication API. based on the TransportLayer_send function; its behavior can be
summarized into two main actions:
• segmentation of the buffer into I-frames, with the relevant header construction and
insertion; this action exploits Pdu class operators.
• sending the I-frame according to the ARQ protocol.
MasterFrame.h
#include <systemc.h>
#include <occn.h>
#include "inout_pdu.h"
MasterFrame.cc
#include "MasterFrame.h"
MasterFrame::MasterFrame(sc_time t) : timeout(t){}
MasterFrame::~MasterFrame() {}
The code block “Receiver.h” describes the interface of the Receiver module.
“Receiver.cc” implements a SC_THREAD process, called action_rx, which reads the
buffer through the API TransportLayer_receive and prints it out. The channel is
accessed through the SlaveFrame sap_rx specialized port.
Receiver.h
#include <systemc.h>
#include <occn.h>
#include "inout_pdu.h"
#include "SlaveFrame.h"
Receiver.cc
#include <stdlib.h>
#include "receiver.h"
receiver::~receiver()
{delete buffer_rx;}
The method that checks for errors by the EDC simply decides randomly on the
correctness of the frame.
SlaveFrame.h
#include <systemc.h>
#include <occn.h>
#include "inout_pdu.h"
SlaveFrame.cc
#include "SlaveFrame.h"
SlaveFrame::SlaveFrame() {}
SlaveFrame::~SlaveFrame() {}
BufferType* SlaveFrame::TransportLayer_receive() {
BufferType& buffer_rx = *(new BufferType);
I_FrameType& frame_rx = *(new I_FrameType);
ACK_FrameType& frame_ack_tx = *(new ACK_FrameType);
const seq_num total_frames =
buffer_rx.sdu_size/frame_rx.sdu_size;
Random rnd;
seq_num sequence_expected = 0;
do {
frame_rx = *receive();
if (EDC_function_rx(frame_rx)){ // frame_rx is not corrupted
if (occn_hdr(frame_rx, sequence)==sequence_expected) {
buffer_rx << frame_rx;
sequence_expected++; }
occn_hdr(frame_ack_tx, ack) = occn_hdr(frame_rx, sequence);
occn_hdr(frame_ack_tx,source_id)=occn_hdr(frame_rx,source_id);
} else { // frame_rx is corrupted
occn_hdr(frame_ack_tx, ack) = ERROR_CODE; }
reply();
wait(rnd.integer(30),SC_NS); // random latency of slave
send(&frame_ack_tx);
} while (sequence_expected < total_frames);
return &buffer_rx;
}
Finally, the main.cc references to all modules, and as shown in the code block above, it
• instantiates the simulator Clock and defines the timeout delay,
int main() {
sc_clock clock1("clk",10,SC_NS);
sc_time timeout_tx(40, SC_NS); // time out for the ARQ protocol
transmitter my_master("Transmitter", timeout_tx);
receiver my_slave("Receiver");
sc_start(-1);
return -1;
}
This Section explains communication refinement, if the proprietary STBus NoC is used
instead of the generic OCCN StdChannel. The refinement is described after a brief
presentation of the STBus
converging from different sources, such as the transputer (ST20), the Chameleon
program (ST40, ST50), MPEG video processing and VCI (Virtual Component Interface)
organization. Today STBus is not only a communication system characterized by
protocol, interfaces, transaction set, and IPs, but also a technology allowing to design and
implement communication networks for SoC. Thus, STBus supports a development
environment including tools for system level design, architectural exploration, silicon
design, physical implementation and verification.
There exist three types of STBus protocols, with different complexity and
implementation characteristics and various performance requirements.
• Type1 is the simplest protocol. The protocol is intended for peripheral register access.
Th protocol acts as an RG protocol, without any pipelining. Simple load, and store
operations of several bytes are supported.
• Type 2 includes pipelining features. It is equivalent to the “basic” RGV protocol. It
supports an operation code for ordered transactions. The number of request cells in a
packet is the same as the number of response cells.
• Type 3 is an advanced protocol implementing split transactions for high bandwidth
requirements (high performance systems). It supports out of order execution. The size
of a response packet may be different than the size of a request packet. The interface
maps the STBus transaction set on a physical set of wires defined by this interface.
STBus
P bits Type 2
MEM
STBus
P bits STBus
N bits
Type 2 Initiator
Type 2
P Bits
STBus P-N Converter MEM
P bits Type 2
node z
Type 2 Initiator N bits
P Bits node y STBus STBus
STBus P bits N bits Type 2
P bits
P bits Type 2
Type 2 Initiator STBus
N bits
P Bits
STBus N-P Converter Type 2/Type 1
P bits Type 2 Converter
Type 2 Initiator
P Bits
STBus STBus
P bits N bits
STBus
N bits
STBus
N bits Type 1
Type 2 Initiator
N bits Register
File
STBus
N bits
Type 2 Initiator
N bits
Figure 20. The StBus NoC, illustrating initiators and target components (shown as white boxes)
As shown in Figure 20, STBus NoC is built using several components, such as node,
register decoder, type converter, size converter. The node is the core component,
responsible for the arbitration and routing of transactions. The arbitration is performed by
several components implementing various algorithms. A TLM cycle-accurate model for
STBus has been developed using OCCN. The model provides all OCCN benefits, such as
in simplicity, speed, and protocol in-lining. System architects are currently using this
model in order to define and validate new architectures, evaluate arbitration algorithms,
and discover trade-offs in power consumption, area, clock speed, bus type,
request/receive packet size, pipeling (asynchronous/synchronous scheduling, number of
stages), FIFO sizes, arbitration schemes (priority, least recently used), latency, and
aggregated throughput.
API API
Process Process
action_tx OCCN Channel
STBus action_rx
Port Port
Application API
Adaptation Layer
Communication API
Refinement of the transport layer data transfer case-study is based on the simplest
member of the STBus family. STBus Type 1 acts as an RG protocol, involves no
pipelining, supports basic load/store operations, and is targeted at modules with low
complexity, medium data rate system communication requirements.
Figure 22 shows the simple handshake interface of STBus Type 1 [51, 52]. This interface
supports a limited set of operations based on a packet containing one or more cells at the
interface. Each request cell contains various fields: operation type (opc), position of last
cell in the operation (eop), address of operation (add), data (data) for memory write, and
relevant byte enable signals (be). The request cell is transmitted to the target, which
acknowledges accepting the cell by asserting a handshake (r_req) and sending back a
response packet. This response contains a number of cells, with each cell containing data
(r_data) for read transactions, and optionally error information (r_opc) indicating that a
specific operation is not supported, or that access to an address location within this device
is not allowed. STBus uses r_opc information to diagnose various system errors.
According to Figure 22, the frame must be mapped to an STBus_request Pdu describing
the path from initiator (transmitter) to target (receiver), and an STBus_response Pdu
representing the opposite path. In particular,
• the payload corresponding to STBus write data lines (data) is mapped to the body of
the STBus_request Pdu.
• buffer segmentation and reassembly are implemented as in StdChannel, by exploiting
the >> and << operators of the OCCN Pdu object (the payload size is the same).
• the destination address corresponds to the STBus addr signal (which is a part of the
STBus_request Pdu header), and
• the extra bits for the EDC are implemented as extra lines of the write data path.
With Type 1 protocol, completion of the initiator send means that the target has received
the data. Thus, we avoid explicit implementation of the sequence, source_id (both
fields) and ack fields in Figure 22.
Since there is no direct access to the signal interface or the communication channel
characteristics, we do not need to modify Transmitter or Receiver application modules, or
the relevant test benches. Thus, we achieve module and tester design reuse at any level of
abstraction without any rewriting and without riping up and re-routing communication
blocks. This methodology facilitates OCCA design exploration through efficient
hardware/software partitioning and testing the effect of various functions of bus
architectures. In addition, OCCN extends state-of-the-art in communication refinement
by presenting the user with a powerful, simple, flexible and compositional approach that
enables rapid IP design and system level reuse
OCCN methodology for collecting statistics from system components can be applied to
any modeling object. For advanced statistical data, which may include preprocessing, one
may also directly use the public OCCN statistical classes. In order to generate basic
statistics information appropriate enable_stat_ function calls must be made usually
from within the module constructor. For example, we show below the function call for
obtaining write throughput (read throughput is similar) statistics.
// Enable statistics collection in [0,50000] with number of samples = 1
enable_stat_throughput_read(0, 50000, 1, "Simulation Time",
"Average Throughput for Write Access", “buffer”);
Considering our transport layer data transfer case-study without taking into account
retransmissions, we measure the effective throughput in I-frame transfers in Mbytes/sec.
Assuming packet-loss transmission, and a receiver that provides an acknowledgment
after every clock cycle, i.e. a frame is transmitted every 2 cycles, the StdChannel can
transmit a payload of 4 bytes during every (10ns) clock cycle. Thus, the maximum
throughput is 200Mbytes/sec.
Figure 23 assumes a receiver with random response latency (not greater than 3 clock
cycles), and an unreliable connection. In the left graph, an appropriate timeout is chosen
according to the receiver latency, while in the right graph, the chosen timeout is too short,
thus a high number of retransmissions occur. This obviously decreases the performance
of the adopted ARQ protocol. Observe how graph axes and title are composed from
enable_stat_throughput_read function parameters. Also notice that units and
legends are always provided automatically. Legends list the object name (computed
automatically from the constructor) and time_window. Internal figure numbers are
always included in the corresponding (*.agr) Grace™ file names; this helps in organizing
multiple graphs when preparing a large document for desktop publishing.
Similarly, for obtaining delay statistics, we make the following function call:
enable_stat_delay(0,50000,"Arrival Time","Departure Time", “stat_box”);
Notice that delay statistics are only meaningful for objects which perform consecutive
write and read operations; these objects essentially serve as transit points: they first
accept information, and then, after a queuing delay, they propagate information. Thus, for
delay statistics, a read access should always occur before the matching write access!
flow now proceeds normally with routing, placement, and optimization by interacting
with various tools, such as Physical Compiler, Chip Architect, and PrimeTime.
The OCCN project is aimed at developing new technology for the design of network on-
chip for next generation high-performance MPSoCs. The OCCN framework focuses on
modeling complex network on-chip by providing a flexible, state-of-the-art, object-
oriented C++-based methodology consisting of an open-source, GNU General Public
Licensed library, built on top of SystemC. OCCN methodology is based on separating
computation from communication, and establishing communication layers, with each
layer translating transaction requests to lower-level communication protocols.
Furthermore, OCCN provides several important modeling features.
• Object-oriented design concepts, fully exploiting advantages of this software
development paradigm.
• Efficient system-level modeling at various levels of abstraction. For example, OCCN
allows modeling of reconfigurable communication systems, e.g. based on
reconfigurable FPGA. In these models, both the channel structure and binding change
during runtime.
• Optimized design based on system modularity, refinement of communication
protocols, and IP reuse principles. Notice that even if we completely change the
internal data representation and implementation semantics of a particular system
module (or communication channel), while keeping a similar external interface, users
can continue to use the module in the same way.
• Reduced model development time and improved simulation speed through powerful
C++ classes.
• System-level debugging using a seamless approach, i.e. the core debugger is able to
send detailed requests to the model, e.g. dump memory, or insert breakpoint.
• Plug-and-play integration and exchange of models with system-level toolssupporting
SystemC, such as System Studio(Synopsys), NC-Sim (Cadence), and Coware,
making SystemC model reuse a reality.
• Efficient simulation using direct linking with standard, nonproprietary SystemC
versions.
• Early design space exploration for defining the merits of new ideas in OCCA models,
including a generic, reusable, robust, and bug-free C++ statistical library for
exploring system-level performance modeling. In addition, we have discussed
statistical library extensions, including advanced monitoring features, such as
generation, processing, dissemination and presentation.
Low power consumption is one of the crucial factors determining the success of modern,
portable electronic systems. Reduction of chip overheating, which negatively affects
circuit reliability, are the main factors driving massive investments in power conscious
design solutions. New methodologies try to address power optimization from the early
stages of the system design, leveraging the degrees of freedom available during
architectural conception and hardware/software partitioning of the system. Thus,
correlated power estimation is performed by using statistics on the bus resource’s usage,
e.g. by collecting data on the number and size of packets sent and received, and keeping
track of the average bus utilization rate. One may use this information together with the
bus topology to figure out the power and energy cost of the system, either in form of
analytic equations or as lookup tables. Power consumption may be modeled at the
Functional level, Transaction level (TLM), Bus Cycle Accurate level (BCA), or All
Cycle Accurate (ACA) level; ACA provides cycle- and pin-accurate level description of
all system modules. ST Microelectronics is currently developing a wide range of power
macro-modeling for STBus, and AMBA AHB and APB buses. These models would
eventually be hooked for SystemC simulation.
We also hope to develop new methodology and efficient algorithms for automatic design
exploration of high-performance network on-chip. Within this scope, we hope to focus on
system-level performance characterization and power estimation using statistical
(correlated) macros for NoC models, considering topology, data flow, and
communication protocol design. These models would eventually be hooked to the OCCN
framework for SystemC simulation data analysis, and visualization.
While OCCN focuses on NoC modeling, providing important modeling components and
appropriate design methodology, more tools are needed to achieve overall NoC design.
For example, interactive and off-line visualization techniques would enable detailed
performance modeling and analysis, by
• developing advanced monitoring features, such as generation, processing,
dissemination and presentation,
• providing asynchronous statistics classes with the necessary abstract data types to
support waves, concurrency map data structures and system snapshots, and
• combining modeling metrics with platform performance indicators which focus on
monitoring system statistics, e.g. simulation speed, computation and communication
load. These metrics are especially helpful in improving simulation performance in
parallel and distributed platforms, e.g. through automatic data partitioning, or
dynamic load balancing.
References
1. Albonesi, D.H., and Koren, I. “STATS: A framework for microprocessor and system-level design space
exploration”. J. Syst. Arch. , 45, 1999, pp. 1097-1110.
2. Amba Bus, Arm, http://www.arm.com
3. Benini, L., and De Micheli, G. Networks on Chips: “A new SoC paradigm”, Computer, vol. 35 (1), 2002, pp.
70—781.
4. Bertozzi, D., Benini , L., and De Micheli, G. “Error control schemes for on-chip interconnection networks:
reliability versus energy efficiency”. Networks on Chip, Eds. A. Jantsch and H. Tenhunen, Kluwer Academic
Publisher, 2003, ISBN: 1-4020-7392-5.
5. Bertozzi, D., Benini , L., and De Micheli, G. “Low power error resilient for on-chip data buses, Proc. Design
Automation & Test in Europe Conf., 2002, pp.102-109
6. Brunel J-Y., Kruijtzer W.M., Kenter, H.J. et al. “Cosy communication IP's”. Proc. Design Automation Conf.,
2000, pp. 406-409.
7. Bolsens, I., De Man H.J., Lin, B., van Rompaey, K., Vercauteren, S., and Verkest, D. “Hardware/software co-
design of digital communication systems”. Proc. IEEE, 85(3), 1997, pp. 391—418.
8. Carloni, L.P. and Sangiovanni-Vincentelli, A.L. “Coping with latency in SoC design”. IEEE Micro, Special Issue
on Systems on Chip, Vol. 22-5, 2002, pp. 24—35.
9. Carloni, L.P., McMillan, K.L. and Sangiovanni-Vincentelli, A.L. Theory of latency-insensitive design. IEEE
Trans. Computer-Aided Design of Integrated Circuits & Syst. Vol. 20-9, 2001, pp 1059—1076.
10. Caldari, M., Conti, M., Pieralisi, L., Turchetti, C., Coppola, M., and Curaba, S., “Transaction-level models for
Amba bus architecture using SystemC 2.0”. Proc. Design Automation Conf., Munich, Germany, 2003.
11. Chandy, K. M. and Lamport, L. “Distributed snapshots: determining global states of distributed systems”. ACM
Trans. Comp. Syst., 3 (1), 1985, pp. 63-75.
12. Christian, F. “Probabilistic clock synchronization”. Distr. Comput., 3, 1989, pp. 146-158.
13. Cierto virtual component co-design (VCC), Cadence Design Systems, see
http://www.cadence.com/articles/vcc.html
14. Coppola, M., Curaba, S., Grammatikakis M.D. and Maruccia, G. “IPSIM: SystemC 3.0 enhancements for
communication refinement”, Proc. Design Automation & Test in Europe Conf., 2003, pp. 106—111.
15. Coppola, M., Curaba, S., Grammatikakis, M., Maruccia, G., and Papariello, F. “The OCCN user manual”. Also
see http://occn.sourceforge.net (downloads to become available soon)
16. Coppola, M., Curaba, S., Grammatikakis, M and Maruccia, G. “ST IPSim reference manual”, internal document,
ST Microelectronics, September 2002.
17. Coppola, M., Curaba, S., Grammatikakis, M. and Maruccia, G. “ST IPSim user manual”, internal document, ST
Microelectronics, September 2002.
18. Dewey, A., Ren, H., Zhang, T. “Behaviour modeling of microelectromechanical systems (MEMS) with statistical
performance variability reduction and sensitivity analysis”. IEEE Trans. Circuits and Systems, 47 (2), 2002, pp.
105—113.
19. Diep, T.A., and Shen J.R. “A visual-based microarchitecture testbench”, IEEE Computer, 28(12), 1995, pp. 57—
64.
20. T. Dumitras, T., Kerner, S., and Marculescu, R. 'Towards on-chip fault-tolerant communication', Proc. Asia and
S. Pacific Design Automation Conf., Kitakyushu, Japan, 2003.
21. De Bernardinis, F., Serge, M. and Lavagno, L. “Developing a methodology for protocol design ”. Research Report
SRC DC324-028, Cadence Berkeley Labs, 1998.
22. Ferrari, A. and Sangiovanni-Vincentelli, A. “System design: traditional concepts and new paradigms”. Proc.
Conf. Computer Design, 1999, pp. 2—13.
23. Fidge C. J.. “Partial orders for parallel debugging”, Proc. ACM Workshop Parallel Distr. Debug., 1988, pp. 183-
194.
24. Forsell, M. “A scalable high-performance computing solution for networks on chips”, IEEE Micro, 22 (5), pp.
46—55, 2002.
25. Gajski, D.D., Zhu, J., Doemer, A., Gerstlauer, S., Zhao, S. “SpecC: Specification language and methodology”.
Kluwer Academic Publishers, 20003. Also see http://www.specc.org
26. Guerrier, P., and Greiner, A. "A generic architecture for on-chip packet-switched interconnections", Proc. Design,
Automation & Test in Europe Conf., 2000, pp. 250—256.
27. Grammatikakis, M.D., and Coppola, M. "Software for multiprocessor networks on chip", Networks on Chip, Eds.
A. Jantsch and H. Tenhunen, Kluwer Academic Publishers, 2003, ISBN: 1-4020-7392-5.
28. Grammatikakis, M.D., Hsu, D. F. Hsu and Kraetzl, M. "Parallel System Interconnections and Communications",
CRC press, 2000, ISBN: 0-849-33153-6,
29. Raghunathan, V., Srivastava M.B., and Gupta, R.K. “A survey of techniques for energy efficient on-chip
communication”,. Proc. Design Automation Conf., Anaheim, California, 2003.
30. Haverinen, A., Leclercq, M, Weyrich, N, Wingard, D. “SystemC-based SoC communication modeling for the
OCP protocol”, white paper submitted to SystemC, 2002. Also see http://www.ocpip.org/home
31. Holzmann, G. J. “Design and validation of computer protocols”. Prentice-Hall International Editions, 1991
32. Klindworth, A. “VHDL model for an SRAM”. Report, CS Dept, Uni-Hamburg. See http://tech-
www.informatik.uni-hamburg.de/vhdl/models/sram/sram.html
33. Hunt G. C., Michael, M., S. Parthasarathy and Scott, M.L.. “An efficient algorithm for concurrent priority queue
heaps”. Inf. Proc. Letters, 60 (3), 1996, pp. 151-157.
34. Krolikoski, S., Schirrmeister, F., Salefski, B. Rowson, J., and Martin, G. "Methodology and technology for virtual
component driven hardware/software co-design on the system level", Int. Symp. Circ. and Syst. Orlando, Florida,
1999.
35. “IBM On-chip CoreConnect Bus”. Available from http://www.chips.ibm.com/products/coreconnect
36. Lahiri, K., Raghunathan, A. and Dey, S. “Evaluation of the traffic performance characteristics of SoC
Communication Architectures”, Proc. Conf. VLSI Design, Jan. 2001.
37. Lahiri, K., Raghunathan, A., and Dey, S. "Design space exploration for optimizing on-chip communication
networks", to appear, IEEE Trans. on Computer Aided-Design of Integrated Circuits and Systems.
38. Lahiri, K., Raghunathan, A., and Dey, S. "System level performance analysis for designing on-chip
communication architectures", IEEE Trans. on Computer Aided-Design of Integrated Circuits and Systems, 20 (6),
2001, pp.768-783.
39. Lamport, L. “Time, clocks and the ordering of events in distributed systems”. Comm. ACM, 21 (7), 1978, pp. 558-
564.
40. Michael, M. and Scott, M.L.. “Simple, fast and practical non-blocking and blocking concurrent queue algorithms”.
Proc. ACM Symp. Princ. Distrib. Comput., 1996 , pp. 267-275.
41. Networks on Chip, Eds. Jantsch, A. and Tenhunen, H. .Kluwer Academic Publisher, 2003, ISBN: 1-4020-7392-5.
42. Nussbaum, D., and Agarwal, A. “Scalability of parallel machines”. Comm. ACM, 34(3), pp. 56--61, 1991.
43. Paulin, P., Pilkington, C., and Bensoudane E., "StepNP: A system-level exploration platform for network
processors", IEEE Design and Test, 2002, 19 (6), 17-26.
44. Prakash, S., Yann-Hang, L. and Johnson, T. “A non-blocking algorithm for shared queues using compare-and-
swap”. IEEE Trans. Comput., C-43 (5), 1994, pp. 548-559.
45. Poursepanj, A. “The PowerPC performance modeling methodology”. Comm. ACM, 37(6), 1994, pp 47—55.
46. Raw Architecure Workstation. Available from http://www.cag.lcs.mit.edu/raw
47. Rowson, J.A. and Sangiovanni-Vincentelli, A.L. “Interface-based design”. Proc. Design Automation Conf. 1997,
pp. 178–183.
48. Salefski, B., and G. Martin, G. "System level design of SoC's", Int. Hard. Desc. Lang. Conf., 2000, pp. 3-10. Also
in “System On Chip Methodology and Design Languages”, eds. Ashenden, P.J., Mermet, J.P., and Seepold, R.
Kluwer Academic Publisher, 2001.
49. Selic, B, Gullekson, G., and Ward P.T. “Real-time object-oriented modeling”, J. Wiley & Sons, NY, 1994.
50. Sgroi, M. Sheets, M. Mihal, A. et al. "Addressing system-on-a-chip interconnect woes through communication-
based design". Proc. Design Automation Conf., 2001.
51. Scandurra A., Falconeri, G., Jego, B., “STBus communication system: concepts and definitions”, internal
document, ST Microelectronics, 2002.
52. Scandurra A., “STBus communication system: architecture specification”, internal document, ST
Microelectronics, 2002.
53. SystemC, http://www.systemc.org
54. Tanenbaum, A. “Computer networks”. Prentice-Hall, Englewood Cliffs, NJ, 1999.
55. Turek, J., Shasha, D. and Prakash, S. “Locking without blocking: making lock-based concurrent data structure
algorithms nonblocking”. Proc. ACM Symp. Princ. Database Syst., 1992, pp. 212-222.
56. Turner, J. and Yamanaka, N. "Architectural choices in large scale ATM switches," IEICE Trans. on
Communications, vol. E-81B, Feb. 1998.
57. Verkest, D., Kunkel, J. and Schirrmeister, F. “System level design using C++”. Proc. Design, Automation & Test
in Europe Conf., 2000, pp. 74—83.
58. VSI Alliance, http://www.vsi.org/
59. Wicker, S. “Error control systems for digital communication and storage”, Englewood Cliffs, Prentice Hall, 1995.
60. Zivkovic, V.D., van der Wolf, P., Deprettere, E.F., and de Kock, E.A. “Design space exploration of streaming
multiprocessor srchitectures”, IEEE Workshop on Signal Processing Systems, San Diego, Ca, 2002.
61. Zhang, T., Chakrabarty, K., Fair, R.B. “Integrated hierarchical design of microelectrofluidic systems using
SystemC”. Microelectronics J., 33, 2002, pp. 459—470.