Unit 1 Introduction To Embedded System Design
Unit 1 Introduction To Embedded System Design
Unit 1 Introduction To Embedded System Design
Unit 1: Syllabus
Introduction to Embedded System Design 09 Hrs
Introduction, Characteristics of Embedding Computing Applications, Concept of
Real time Systems, Challenges in Embedded System Design, Design Process:
Requirements, Specifications, Hardware Software Partitioning, System Integration
Embedded System Architecture
Instruction Set Architectures with examples, Memory system Architecture: Von
Neumann, Harvard, caches, Virtual Memory, Memory Management, I/O sub
system: Busy wait I/O,DMA, Interrupt Driven I/O, Co-Processor & Hardware
Accelerators, Processor performance Enhancement: Pipelining, Superscalar
Execution, Multi Core CPUs, Benchmarking Standards: MIPS, MFLOPS, MMACS,
Coremark
2 MGRJ,ECE,RVCE
Block diagram of a
computer system
MP
IC
4 MGRJ,ECE,RVCE
Embedded System: Definition
An Electronic/Electro mechanical system which is designed to perform a specific
function and is a combination of both hardware and firmware (Software).
5 MGRJ, ECE
The Typical Embedded System
6 MGRJ, ECE
Source: Ref. 2
The Core of the Embedded Systems
-The core of the embedded system falls into any one(or more) of the following
categories.
• General Purpose and Domain Specific Processors
• Microprocessors
• Microcontrollers
• Digital Signal Processors
• Programmable Logic Devices (PLDs)
• Application Specific Integrated Circuits (ASICs)
• Commercial off the shelf Components (COTS)
7 MGRJ, ECE
Sensors & Actuators
Sensor:
A transducer device which converts energy from one form to another for
any measurement or control purpose. Sensors acts as input device
Eg. Hall Effect Sensor which measures the distance between
the cushion and magnet in the Smart Running shoes from
Adidas
Actuator:
A form of transducer device (mechanical or electrical) which converts
signals to corresponding physical action (motion). Actuator acts as an
output device Electronics-enabled “Smart” running shoes from Adidas
Photo Courtesy of Adidas (www.adidas.com)
10 MGRJ, ECE
Embedded Systems Vs General Computing Systems
General Purpose System Embedded System
A system which is a combination of
A system which is a combination of
generic hardware and General Purpose special purpose hardware and embedded
Operating System for executing a OS for executing a specific set of
variety of applications. applications.
May or may not contain an operating
Contain a General Purpose Operating
System (GPOS). system for functioning.
The firmware of the embedded system is
Applications are alterable
(programmable) by user (It is possible pre-programmed and it is non-
for the end user to re-install the alterable by end-user (There may be
Operating System, and add or remove exceptions for systems supporting OS
user applications). kernel image flashing through special
11 MGRJ, ECE
hardware settings).
Embedded Systems Vs General Computing Systems
General Purpose System Embedded System
• Performance is the key deciding factor Application specific requirements (like
on the selection of the system. Always performance, power requirements,
‘Faster is Better’. memory usage etc.) are the key deciding
• Less/not at all tailored towards reduced factors.
operating power requirements, options Highly tailored to take advantage of the
for different levels of power power saving modes supported by
management. hardware and Operating System
• Response requirements are not time For certain category of embedded systems
critical. like mission critical systems, the response
time requirement is highly critical.
12 MGRJ, ECE
Embedded Systems Vs General Computing Systems
13 MGRJ, ECE
What is this?
14 MGRJ,ECE,RVCE
Embedded Everywhere !
A Day in The Life Rebrand Final.wmv
15 MGRJ,ECE,RVCE
Characteristics of Embedding Computing Applications
Application and Domain Specific.
Reactive and Real time
Operates in Harsh environment
Distributed
Small size and Weight
Power Concerns
Compact Systems…
16 MGRJ,ECE,RVCE
Characteristics…
Application and Domain Specific Systems
Embedded systems are not general-purpose computers.
Optimized for a specific application.
Many of the job characteristics are known before the hardware is designed
which allows the designer to focus on the specific design constraints of a well
defined application.
Embedded S/W usually cannot run on other embedded systems without
modification.
Hardware tailored to an application.
– Unnecessary circuitry is eliminated
– Resources shared if possible.
17 MGRJ, ECE
Characteristics …
18 MGRJ, ECE
Real & Real time systems…
One of the biggest challenges for embedded system designers is,
performing an accurate worst case design analysis on systems with
statistical performance characteristics (e.g., cache memory on a DSP
or other embedded processor).
Accurately predicting the worst case may be difficult in complicated
architectures.
Real time system operation means the timing behavior of the system
should be deterministic. Real Time system should not miss deadline.
Ex: Mission control, Flight control system etc…
19 MGRJ, ECE
Characteristics …
Harsh environment
Many embedded systems do not operate in a controlled
environment.
Excessive heat is often a problem, especially in applications involving
combustion (e.g., Automobile applications).
Protection from vibration, shock, lightning, power supply
fluctuations, water, corrosion, fire, and general physical damage.
Embedded system designer is to model accurately the different
parameters of harsh environment in real world.
20 MGRJ, ECE
Characteristics …
Small and low weight
Many embedded computers are physically located within some larger system.
Challenge is to develop non-rectangular geometries for certain solutions.
Weight can also be a critical constraint.
Power Concerns
When controlling physical equipment, large current loads may need to be
switched in order to operate motors and other actuators.
Designer must carefully balance system tradeoffs among analog components,
power, mechanical, network, and digital hardware with corresponding software.
Power management need to be considered in designing embedded system.
System should be designed in such a way as to minimize the heat dissipation by
the system. More power consumption less battery life.
21 MGRJ, ECE
Characteristics …
Distributed Systems
A set of nodes connected by the network, cooperating to achieve a common goal
- Node: a µC + I/O + communication interface.
- One or multiple networks: wired, wireless.
Ex: Embedded systems in automobiles
22 MGRJ,ECE,RVCE
Real Time Systems/Services
The real time service is triggered by a real world event and produces a
corresponding system response, how long this transformation of input to
output takes is a key design issue.
The real time services are often implemented by integrating H/W and S/W
components.
The real time systems either polls sensors on a periodic basis, or the sensor
components provide digitized data on a known sampling interval with an
interrupt generated to the controller.
The real time systems are categorized into Hard real time and Soft real
time systems based on time of completion.
23 MGRJ, ECE
Real Time Systems….
Service Response Timeline
24 MGR,ECE,RVCE
Steps involved in the Design
The Design happens in three steps mainly:
1. Modeling: is the process of gaining a deeper
understanding of a system through imitation. Models
express what a system does or should do.
2. Design : is the structured creation of hardware &
Software. It specifies how a system does what it does.
3. Analysis: is the process of gaining a deeper understanding
of a system through dissection. It specifies why a system
does what it does (or fails to do what a model says it
should do).
MGRJ, ECE Source: NPTEL Course “Embedded System Design” by Arnab Sarkar, IIT, Guwahati
25
Design Process….
Each design step is consisting of number of operations.
Modelling Analysis
• Requirements Design
• Functionality test
• Specifications • Architecture design
-Architecture Selection:
• Objective and closeness
• The product requirements are • Choice of processing elements functions defined by
captured from the customer and • standard/custom/semi-custom combining metrics like power,
converted into system level needs HW area, etc are evaluated.
or processing requirements. • Memories, Interfacing, • If it is not close to the
• English (or other natural communication
language) is common starting expected value, the design
• Hardware software portioning
point. and modelling processes are
• Hardware software codesign
• Computation models are used to reiterated.
capture the behaviour. • Component design
• In Real Time(RT) systems,
E.g: Sequential program model. • System Integration
timing behaviour must be
Finite state machine model.
Communicating process verified in addition to
model.
MGRJ,ECE,RVCE
functional correctness.
26
Example: An Elevator Controller
Partial English Description
“Move the elevator either up or down to reach the requested floor. Once at the
requested floor, open the door for at least 10 seconds, and keep it open until the
requested floor changes. Ensure the door is never open while moving. Don’t change
directions unless there are no higher requests when moving up or no lower requests
when moving down…”
Source: NPTEL Course “Embedded System Design” by Arnab Sarkar, IIT, Guwahati
27 MGRJ, ECE
Elevator Controller…
Simple Elevator Controller
‒Request Resolver resolves various floor requests into
single requested floor.
‒Unit Control moves elevator to this requested floor.
28 MGRJ, ECE
Modelling: Sequential Program Model
Declarations:
29 MGRJ, ECE
Modelling: Finite State Machine(FSM) model
FSM for UnitContol process.
FSM model is described by considering systems with:
- Possible states:
E.g: Idle, GoingDown, GoingUp, DoorOpen
- Possible transitions from one state to another based on input
E.g: req> floor
- Actions that occur in each state
E.g:In the GoingUpstate, u,d,o,t= 1,0,0,0 (up = 1, down, open, and timer_start= 0)
30 MGRJ, ECE
FSM model…
UnitControl process using a state machine.
31 MGRJ, ECE
Hardware Software Partitioning
Many functions can be done by software on a general purpose microprocessor OR by
hardware on an application specific ICs (ASICs)
E.g: Game console graphic, PWM, PID control(Hardware).
Leads to Hardware/Software Co-design concept.
Where to place functionality?
E.g: A Sort algorithm Faster in hardware, but more expensive.
More flexible in software but slower.
Designer must be able to explore these various trade-offs:
▪ Speed.
▪ Reliability.
▪ Cost
▪ Form (size, weight, and power constraints.)
32 MGRJ, ECE
Hardware Software Partitioning…
Move “bottleneck” computations from software to hardware.
Hardware Implementation
33 MGRJ, ECE
Source: http://class.ece.iastate.edu/cpre488/lectures/Lect-08.pdf
Example:
FIR Filter
Source: NPTEL Course “Embedded System Design” by Arnab Sarkar, IIT, Guwahati
35 MGRJ, ECE
Hardware Software Co-design(FPGA Synthesis)
Source: NPTEL Course “Embedded System Design” by Arnab Sarkar, IIT, Guwahati
36 MGRJ, ECE
Tutorial 1
Problem 1
Design an 8051 based HONEY BEE COUNTER with following specifications. The
bees are assumed to enter the bee hive in rectangular box through a small hole.
Another hole is made for the bees to exit. Assume suitable sensors are placed at
entry & exit holes. The system is designed to display the number of bees in hive at
any time. Assume initially there are no bees in hive.
Write block diagram & pseudo code of above system implementation.
37 MGRJ, ECE
Tutorial 1
Problem 2
Design an 8051 based system to control temperature of the furnace with following
specifications. The furnace temperature has to maintain at 30±10C. Connect
suitable sensors & actuators. Display the temperature on LCD. The power
consumption has to be minimized. Show the design & implementation (diagram
+Program).
38 MGRJ, ECE
Tutorial 1
Problem3
The 8051 MCUs are used for control automation of chemical plant. An 8051
MCU is used to control the liquid flow of blast furnace. Another 8051 MCU is
used control the temperature of blast furnace. The liquid level & temperature of
blast furnace is displayed in master room powered by another 8051 MCU.
Design a scheme for connecting above mentioned 8051 MCUs using full duplex
serial communication with variable baud rate. Select any one controller as a master
and assign address to Slaves. Write block diagram showing connection of
controllers. It is learnt that master communicate with a slave every 10ms. Write
ALP for master to perform this communication.
39 MGRJ, ECE
Hardware Components
Embedded Processor: ISAs
An instruction set, or instruction set architecture (ISA), is the part of the
processor architecture related to programming.
All processors are supported by instruction set /instructions (Assembly instructions)
which are dependent on organization of different components in PLU.
Depending upon the way of supporting different instructions, the ISA is divided into
-Reduced Instruction Set Computer(RISC)
-Complex Instruction Set Computer(CISC)
Other types of ISAs
-Very Long Instruction Word(VLIW), etc.
40 MGRJ, ECE
CISC & RISC Design Philosophy: CISC Vs RISC
CISC RISC
More number of instructions Lesser no. of instructions.
Instructions are complex to Instructions are Easier to
understand. understand.
• Hardware support for many • Software support for many
instructions (More silicon Usage) instructions/operations.
A programmer can achieve the desired (Less silicon usage)
functionality with a single instruction Programmer needs to write more code to
which in turn provides the effect of using execute a task since the instructions are
more simpler single instructions in RISC simpler ones
• Clock cycles per instruction(CPI) Clock cycles per instruction(CPI) is
is more. less.
41 MGRJ,ECE,RVCE
CISC & RISC Design Philosophy: CISC Vs RISC
CISC RISC
Code density is more. Code density is less.
Less number of registers. More number of registers.
Memory to memory operations are No memory to memory
supported. operations are supported.
Load & store operations in a instruction Load & store operations not in
a instruction ( So called as load-store
architecture)
42 MGRJ,ECE,RVCE
CISC & RISC Design Philosophy: CISC Vs RISC
CISC RISC
Non Orthogonal Instruction Orthogonal Instruction
Set Set
All instructions are not allowed to Allows each instruction to operate on
operate on any register and use any any register and use any addressing
addressing mode. It is instruction mode.
specific. • Examples: ARM, MSP 430, PIC
• Examples: 8086
POWERPC
NOTE: The fact is, the designers are not worried about the architecture(CISC/RISC). So,
the features from both the architectures are mixed up to increase the performance(Increase
speed & reduce memory consumption).
43 MGRJ,ECE,RVCE
Memory Architecture :Von Neumann & Harvard Architecture
This classification is based on processor architecture design to support memory.
Address Space:
- No. of locations a processor/controller can address.
E.g: 8086: Address bus=20 bits, so address space is 1 Mb
(00000H-FFFFFH)
8051: Address bus=16 bits, so address space is 64 Kb
(0000h-FFFFh)
44 MGRJ,ECE,RVCE
Von Neumann/Princeton Architecture
In this architecture, address space is shared between program memory & data memory.
E.g: 8086
-Total Address space is 1Mb
- The address space is segmented(shared) in to code
segment(Program memory) and data segment (data memory).
Common memory for program & data.
Single shared bus(Address, data & control: System bus) for Instruction and Data fetching.
Pgm /data
memory
47 MGRJ, ECE
Memory Organization
Many of the processors and
controllers have memories On chip
arranged in some form of
KB* < 1 ns*
hierarchy.
The fastest memory is physically
located near the processor core MB* 10-30 ns*
L 2 Cache
and the slowest memory is set
GB*
further away.
TB* ~1 ms *
Generally, the closer memory is to
the processor core, the more it
costs and the smaller its capacity.
The figure shows typical memory ~100 ms *
Off chip
hierarchy.
* Access times *Capacity
48 MGRJ,ECE,RVCE Source: “ARM System Developer Guide” by Andrew N Sloss
Memory organization…
The registers are internal to the processor core and provide the fastest possible memory
access in the system.
At the primary level, tightly coupled memory (TCM) and level 1 cache are connected
to processor core using dedicated on-chip interfaces.
The TCMs are not subjected to eviction(no replacement of contents during program
execution) and cache is subjected to eviction, hence cache accessing may result in data
miss.
The main memory include volatile components like SRAM and DRAM, and non-volatile
components like flash memory. The purpose of main memory is to hold programs while
they are running on a system.
The next level is secondary storage a large, slow, relatively inexpensive mass storage
devices such as disk drives or removable memory.
49 MGRJ,ECE,RVCE
Memory Hierarchy…
50 MGRJ,ECE,RVCE
Memory Hierarchy…
Flash Memory
52 MGRJ,ECE,RVCE
Tutorial 1
Perform capacity planning for a two level memory hierarchy system. The first level, M1 is
a cache with three capacity choices 64 Kbytes, 128 Kbytes and 256 Kbytes. The second
level, M2 is a main memory with a 4 Mbyte capacity. Let C1 and C2 be the cost per byte
and t1 and t2 the access times for M1 and M2 respectively. Assume C1=20C2 and t2=10t1.
The cache hit ratios for the three capacities are assumed to be 0.7, 0.9 and 0.98
respectively.
i) What is the average access time ta in terms of t1=20ns in the three cache designs?
ii) Express the average byte cost of the entire memory hierarchy if C2=$0.2/Kbyte.
iii) Compare the three memory designs and indicate the order of merit in terms of
average costs and average access times respectively.
Choose the optimal design based on the product of average cost & average access times.
53 MGRJ,ECE,RVCE
Tutorial 1
Consider a three level memory hierarchy with following specifications:
Design the memory hierarchy to achieve an effective memory access time t=10.04us
with cache hit ratios h1=0.98 and a hit ratio h2=0.9 in the main memory. Also limit the
total cost of the memory hierarchy is upper bounded by $15,000.
54 MGRJ,ECE,RVCE
Cache
Cache is a small, fast buffer (SRAM) between processor and memory.
Old values will be removed from cache to make space for new values and works on the
principle of locality.
CPU-Cache interaction:
• The tiny, very fast CPU register file
➢ The transfer unit between the has room for four 4-byte words.
CPU register file and the cache
is a 4-byte word. • The small fast L1 cache has room
➢ The transfer unit between the for two 4-word blocks.
cache and main memory is a 4-
word block (16B).
• The big slow main memory has
room for many 4-word blocks.
55 MGRJ,ECE,RVCE
Source: NPTEL course on “Advanced Computer Architecture” by Dr. John Jose,IIT,Guhawati
Cache Organization
Cache is an array of sets(S).
Each set contains one or
more lines(E).
Each line holds a block of
data(B) bytes.
Cache Size=S x E x B bytes.
56 MGRJ,ECE,RVCE
Addressing Caches
59 MGRJ,ECE,RVCE
Suggested reading:
Direct mapped cache/Set Associative cache
60 MGRJ, ECE
Virtual Memory
Virtual memory is a memory management capability of an OS that uses hardware and
software to allow a computer to compensate for physical memory shortages by
temporarily transferring data from RAM to disk storage.
E.g. Consider a program with 8 pages(typically 4 KB or 8 KB) and resides in a secondary
storage as shown below and physical memory is 16 pages.
Program Disk Storage Physical memory
The MMU creates an
illusion that there exist
infinitely large pool of
memory in the system. So, a
program with size more
than physical memory size
can be executed.
61 MGRJ, ECE
Virtual Memory: How it works?
The Memory Management Unit(MMU) of OS manages transfer of data between
RAM and disk.
The MMU relocates the virtual address into physical address.
62 MGRJ, ECE
IO Subsystem
Embedded system are interfaced with different IOs to communicate and control
external world.
Data transfer to and from the peripherals to CPU may be done in any of the three
possible ways:
Busy wait IO or Programmed IO or Polling.
Interrupt Driven IO.
Direct Memory Access( DMA) based IO.
Memory-mapped IO allows IO registers to be accessed as memory locations. As a
result, these registers can be accessed using only LOAD and STORE instructions.
IO mapped IO allows IO to be accessed using separate instructions IN and OUT
provided by ISAs.
63 MGRJ, ECE
IO Subsystem….
Busy wait IO
It requires constant monitoring by the CPU of the peripheral devices.
A transfer from IO device to memory requires the execution of several
instructions by the CPU, including an input instruction to transfer the data from
device to the CPU and store instruction to transfer the data from CPU to
memory.
The CPU stays in the program loop until the IO unit indicates that it is ready for
data transfer.
Due to the time needed to poll if IO device is ready, the processor cannot
often perform useful computation.
64 MGRJ, ECE
Busy wait IO..
Example: ADC 0809 Interface to 8051 MCU.
Polling loop: Pseudo Code
void main(void)
{
/* MCU & ADC Initialization
while(1)
{
65 MGRJ, ECE
IO Subsystem….
Interrupt Driven IO
By using interrupt facility and special interface to issue an interrupt request
signal whenever data is available from IO device, CPU can be interrupted for
processing.
In the meantime, the CPU can proceed for any other program execution.
Whenever it is determined that the IO device is ready for data transfer, the
interface(device itself) initiates an interrupt request signal to the computer.
The CPU stops momentarily the task that it was already performing, branches to
the Interrupt Service Routine(ISR) to process the IO transfer, and then
return to the task it was originally performing.
66 MGRJ, ECE
Interrupt Driven IO..
Example: ADC 0809 Interface to 8051 MCU. Interrupt Driven IO: Pseudo Code
void main(void)
{
/* MCU & ADC Initialization
/* Interrupt Initialization
while(1)
{
68 MGRJ, ECE
DMA IO..
Bus Request (HOLD): It is used by the DMA controller to request the CPU to
relinquish the control of the buses.
Bus Grant (HLDA): It is activated by the CPU to Inform the external DMA controller
that, the buses are in high impedance state and the requesting DMA can take control of the
buses.
Burst Transfer :In which, a block sequence consisting of memory words is transferred in
a continuous burst where the DMA controller is the master of the memory buses.
Cyclic Stealing : In this, DMA controller transfers one word at a time after which it
must return the control of the buses to the CPU. The CPU merely delays its operation for
one memory cycle to allow the direct memory IO transfer to “steal” one memory cycle.
69 MGRJ, ECE
Tutorial 2
Question
Choose suitable method to support an IO device. Discuss the criterion to select the
method.
Suggested Reading:
DMA unit of LPC 1857.
70 MGRJ, ECE
Tutorial 2
IO subsystem: Design IO Subsystem
The LM75A is an industry-standard digital temperature sensor with an integrated
sigma-delta analog-to-digital converter (ADC) and I2C interface. The LM75A
provides 9-bit digital temperature with an accuracy of ±2°C from –25°C to 100°C
and ±3°C over –55°C to 125°C.
71 MGRJ, ECE
Source: LM75A data sheet
Hardware Accelerators & Coprocessors
We can use hardware accelerators and coprocessing to create more efficient, higher
throughput designs.
Hardware accelerators are dedicated fixed-function peripherals designed to
perform a single computationally intensive task over and over.
They offload the main processor with general purpose instruction set,
allowing it to do general-purpose tasks.
Application accelerator is not a new concept.
E.g. 8087 Intel Math coprocessor released in 80’s.
But, it received a renewed interest around 2002 due to the single thread
performance stall.
-Frequency scaling became unsustainable with smaller IC feature sizes.
-Instruction-level parallelism (IPL) can go only so far.
72 MGRJ, ECE
Hardware Accelerators..
Analog Devices SHARC® ADSP-2146x
SHARC® ADSP-2146x processor incorporates hardware accelerators for
implementing three widely used signal processing operations: FIR (finite impulse
response), IIR (infinite impulse response), and FFT (fast Fourier transform).
The ADSP-2146x core has a maximum clock rate of 450 MHz. By using SIMD (single-
instruction multiple-data), the core can perform two MAC (multiply-accumulate)
operations per clock cycle for a peak rate of 900 MMAC/sec.
The accelerator in comparison, operates at the clock rate of 225 MHz. Using its four
dedicated MAC units, the FIR accelerator achieves a peak theoretical throughput of
900 MMAC/sec.
Source:White paper on hardware accelerators in SHARC processors by Paul Beckmann, DSP Concepts, LLC.
74 MGRJ, ECE
Analog Devices SHARC® ADSP-2146x
Consider a home theatre system with 7.1 channels of audio at 96 kHz operating at a
block size of 32 samples. Assume that room equalization is being applied by 8 FIR
filters, each 512 points long.
No. of MAC operations: 8 x 512 x 96KHz=393 MAC/sec.
If the core CPU were to perform the filtering, it would take 44% of a 450 MHz
SHARC processor.
This FIR processing represents a significant portion of the overall computation of
CPU and fortunately can be offloaded to the accelerator.
75 MGRJ, ECE
ARM NEON Hardware Accelerators..
Arm NEON technology is an advanced SIMD (single instruction multiple data) architecture
extension for the Arm Cortex-A series and Cortex-R52 processors.
NEON technology is intended to improve the multimedia user experience by accelerating
audio and video encoding/decoding, user interface, 2D/3D graphics or gaming.
NEON instructions allow up to:
77 MGRJ, ECE
Enhancing Performance of Processors
Pipelining
A pipeline is a cascaded connection of processing stages which are connected to
perform a fixed function over a stream of data flowing from one end to the other.
In modern CPUs, the pipelines are applied for instruction execution, arithmetic
computation and memory access operations.
The pipeline is constructed with 𝑘 processing stages. The processed results are passed
from 𝑆𝑖 to stage 𝑆𝑖+1 for all 𝑖 = 1,2, … … . . 𝑘 − 1.
𝑆𝑖 = 𝑠𝑡𝑎𝑔𝑒 𝑖
𝐿 = 𝐿𝑎𝑡𝑐ℎ
𝜏 = 𝐶𝑙𝑜𝑐𝑘 𝑃𝑒𝑟𝑖𝑜𝑑
𝜏𝑚 = 𝑀𝑎𝑥 𝑆𝑡𝑎𝑔𝑒 𝑑𝑒𝑙𝑎𝑦
𝑑 = 𝐿𝑎𝑡𝑐ℎ 𝑑𝑒𝑙𝑎𝑦
78 MGRJ, ECE
Source: Kai Hwang, “Advanced Computer Architecture”, Tata Mcgraw Hill Education.
Clock Cycle(𝝉)
Pipelining…
τ = max τi ∀ i = 1,2 … . k + d = t m + d
Pipeline Frequency or Maximum Throughput
1
𝑓=
𝜏
Ideally, one result is expected to come out of pipeline per cycle.
However, depending on the initiation rate of successive tasks actual throughput of the
pipeline will be lower than 𝑓.
Speedup(𝑺𝒌 )
Ideally, a pipeline with 𝑘 stages can process 𝑛 tasks in 𝑘 + 𝑛 − 1 clock cycles. Where, 𝑘
cycles are needed to complete the execution of the very first task and remaining (𝑛 − 1)
tasks require (𝑛 − 1) cycles.
Total time required: 𝑇𝑘 = [𝑘 + (𝑛 − 1)]𝜏
79 MGRJ, ECE
Speedup….
Consider an equivalent function nonpipelined processor which has a flow through delay
of 𝑘𝜏.
Total time required: 𝑇𝑙 = 𝑛𝑘𝜏.
𝑇𝑙 𝑛𝑘𝜏 𝑛𝑘
Speedup factor 𝑆𝑘 = = =
𝑇𝑘 𝑘+ 𝑛−1 𝜏 𝑘+ 𝑛−1
80
MGRJ, ECE
Speedup….
However, number stages cannot be increased indefinitely because practical constraints
on cost, implementation complexity, circuit implementation, etc.
The figure shows optimal number of pipeline stages(performance cost ratio Vs number
stages).
In practice, most pipelining is staged with 2 ≤ 𝑘 ≤ 15. Very few pipelines are
designed to exceed 10 stages in real computers.
81 MGRJ, ECE
Speedup….
The efficiency 𝐸𝑘 is defined:
𝑆𝑘 𝑛
𝐸𝑘 = = 𝐸𝑘 → 1 as 𝑛 → ∞ (𝑈𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑)
𝑘 𝑘+ 𝑛−1
1
𝐸𝑘 → as 𝑛 = 1 (𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑)
𝑘
The Pipeline Throughput 𝐻𝑘 is defined as the number of tasks performed per unit
time.
𝑛 𝑛𝑓
𝐻𝑘 = = = 𝐸𝑘 . 𝑓
𝑘+ 𝑛−1 𝜏 𝑘+ 𝑛−1
82 MGRJ, ECE
Instruction Pipeline
ARM Cortex M3 three stage pipeline:
- The figure below shows 3 stage pipeline of Cortex M3.
- The Fetch stage fetches instructions from memory, presumably one per cycle.
-The Decode stage reveals the instruction function to be performed and
identifies resources needed .
-The instructions are executed in Execute stage.
Cycles 1 2 3 4 5 ……….
83 MGRJ, ECE
Instruction Pipeline….
A seven stage instruction pipeline
-The figure below show a seven stage pipeline with three Execute(E) stages.
-The Issue(I) stage reserves resources and control pipeline interlocks.
- The Writeback(W) stage used to write results back into the registers.
84 MGRJ, ECE
Seven stage Instruction Pipeline….
85 MGRJ, ECE
Seven stage Instruction Pipeline….
The following figure shows an improved timing after the instruction issuing order
is changed(out of order execution) to eliminate unnecessary delays due to
dependence.
86 MGRJ, ECE
Enhancing Performance of Processors
Superscalar Execution
In a superscalar execution, multiple instruction pipelines are used. This implies multiple
instructions are issued per cycle and multiple results are generated per cycle.
Superscalar processors are designed to exploit more instruction level parallelism in
user programs.
Only independent instructions can be executed in parallel without causing a wait state.
The amount of instruction level parallelism varies widely depending on type of code
being executed.
The instruction issue degree in a super scalar processor has been limited to 2 to 5 in
practice (Average number of instructions to be executed in parallel is 2 without loop
unrolling).
87 MGRJ, ECE
Superscalar Execution….
Time in cycles
88 MGRJ, ECE
Enhancing Performance of Processors
VLIW architecture
• The Very Long Instruction Word (VLIW) architecture uses more
functional units.
• The CPI of VLIW processor is less compared to CISC & RISC processors.
• 256 or 1024 bits per instruction word.
• Programs are written in conventional short instruction words.
• The code compaction must be done by compiler.
• Instruction parallelism and data movement in a VLIW architecture are
completely specified at the compile time.
89 MGRJ, ECE
VLIW architecture…
90 MGRJ, ECE
VLIW architecture…
91 MGRJ, ECE
Enhancing Performance of Processors
Multi core CPUs
Power and frequency limitations observed on single core implementations have
paved the gateway for multicore technology.
The frequency in single core CPUs is limited to 4GHz, as any increase beyond this
frequency increases power dissipation.
A Multi-core processor is typically a single processor which contains several
cores on a chip.
The individual cores on a multi-core processor don’t necessarily run as fast as the
highest performing single-core processors, but they improve overall performance by
handling more tasks in parallel.
92 MGRJ, ECE
Multi core CPUs..
The multiple cores inside the chip are not clocked at a higher frequency, but instead their
capability to execute programs in parallel is ultimately contributes to the overall
performance making them more energy efficient and low power cores as shown in the
figure below.
Multi-core processors could also be implemented as a combination of both
homogeneous and heterogeneous cores.
In homogeneous core architecture, all the cores in the CPU are identical and they
improve the overall processor performance
by breaking up a high computationally intensive
application into less computationally intensive
applications and execute them in parallel.
E.g: AMD Dual cores & Intel Core2 Duo and
Quad Cores.
93 MGRJ, ECE
Source: Intel Higher Education Program & FAER.
Multi core CPUs..
94 MGRJ, ECE
Challenges with multicores:
95 MGRJ, ECE
CPU Benchmarking standards
MIPS(Million Instructions Per Second)
𝐶𝑙𝑜𝑐𝑘 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑀𝐼𝑃𝑆 =
𝐶𝑃𝐼∗1,000,000
MIPS is only an approximation as to a processors performance because some
processor instructions do more work than others with an instruction.
A computer rated at 100 MIPS may be able to compute certain values faster than
another computer rated at 120 MIPS.
Dhrystone MIPS: Dhrystone is a standard program consisting of arithmetic &
logical operations on integers and is used to benchmark CPU.
98 MGRJ, ECE
MIPS…
Tutorial 4
The execution times (in seconds) of three programs on three MCUs are given
below:
99 MGRJ, ECE
CPU Benchmarking standards
MFLOPS(Mega Floating Point Operations per Second)