Unit 1 Introduction To Embedded System Design

Download as pdf or txt
Download as pdf or txt
You are on page 1of 99

Unit 1

Unit 1: Syllabus
 Introduction to Embedded System Design 09 Hrs
Introduction, Characteristics of Embedding Computing Applications, Concept of
Real time Systems, Challenges in Embedded System Design, Design Process:
Requirements, Specifications, Hardware Software Partitioning, System Integration
 Embedded System Architecture
Instruction Set Architectures with examples, Memory system Architecture: Von
Neumann, Harvard, caches, Virtual Memory, Memory Management, I/O sub
system: Busy wait I/O,DMA, Interrupt Driven I/O, Co-Processor & Hardware
Accelerators, Processor performance Enhancement: Pipelining, Superscalar
Execution, Multi Core CPUs, Benchmarking Standards: MIPS, MFLOPS, MMACS,
Coremark

2 MGRJ,ECE,RVCE
Block diagram of a
computer system

MP
IC

 Bus: Collection of wires


3 MGRJ,ECE,RVCE
Source: T.L Floyd, “Digital Fundamentals”, 9e Bus
Block diagram of a computer
 All computer systems consists of basic functional blocks that include a
CPU, memory and input/output ports.
 These blocks are connected together with three internal buses: Address
bus, Control bus, Data bus collectively called as system bus.
 A port is a physical interface on a computer through which data is passed
to and from peripherals.
 The memory includes program memory(ROM) to store instructions to
be executed to solve a specific problem and data memory to store data
during executing instructions.

4 MGRJ,ECE,RVCE
Embedded System: Definition
An Electronic/Electro mechanical system which is designed to perform a specific
function and is a combination of both hardware and firmware (Software).

E.g. Electronic Toys, Mobile Handsets, Washing Machines, Air Conditioners,


Automotive Control Units, Set Top Box, DVD Player etc…

5 MGRJ, ECE
The Typical Embedded System

6 MGRJ, ECE
Source: Ref. 2
The Core of the Embedded Systems
-The core of the embedded system falls into any one(or more) of the following
categories.
• General Purpose and Domain Specific Processors
• Microprocessors
• Microcontrollers
• Digital Signal Processors
• Programmable Logic Devices (PLDs)
• Application Specific Integrated Circuits (ASICs)
• Commercial off the shelf Components (COTS)

7 MGRJ, ECE
Sensors & Actuators
Sensor:
A transducer device which converts energy from one form to another for
any measurement or control purpose. Sensors acts as input device
Eg. Hall Effect Sensor which measures the distance between
the cushion and magnet in the Smart Running shoes from
Adidas
Actuator:
A form of transducer device (mechanical or electrical) which converts
signals to corresponding physical action (motion). Actuator acts as an
output device Electronics-enabled “Smart” running shoes from Adidas
Photo Courtesy of Adidas (www.adidas.com)

Eg. Micro motor actuator which adjusts the position of the


cushioning element in the Smart Running shoes from
MGRJ, ECE
8
Adidas
Communication Interface
• Communication interface is essential for communicating with various subsystems
of the embedded system and with the external world.
• Serial interfaces like I2C, SPI, UART, 1-Wire etc., and Parallel bus interface are
examples of ‘Onboard Communication Interface’.
• The ‘Product level communication interface’ (External Communication
Interface) is responsible for data transfer between the embedded system and
other devices or modules.
• The external communication interface: Infrared (IR), Bluetooth (BT), Wireless
LAN (Wi-Fi), Radio Frequency waves (RF), etc. are examples for wireless
communication interface.
• RS-232C/RS-422/RS 485, USB, Ethernet (TCP-IP), IEEE 1394 port, Parallel
port etc. are examples for wired interfaces.
9 MGRJ, ECE
Embedded Firmware/Software
• The control algorithm (Program instructions) and the
configuration settings that an embedded system developer dumps
into the code (Program) memory of the embedded system.
• The embedded firmware can be developed in various methods like
-Write the program in high level languages.
-Write the program in Assembly Language.

10 MGRJ, ECE
Embedded Systems Vs General Computing Systems
General Purpose System Embedded System
 A system which is a combination of
 A system which is a combination of
generic hardware and General Purpose special purpose hardware and embedded
Operating System for executing a OS for executing a specific set of
variety of applications. applications.
 May or may not contain an operating
 Contain a General Purpose Operating
System (GPOS). system for functioning.
 The firmware of the embedded system is
 Applications are alterable
(programmable) by user (It is possible pre-programmed and it is non-
for the end user to re-install the alterable by end-user (There may be
Operating System, and add or remove exceptions for systems supporting OS
user applications). kernel image flashing through special
11 MGRJ, ECE
hardware settings).
Embedded Systems Vs General Computing Systems
General Purpose System Embedded System
• Performance is the key deciding factor  Application specific requirements (like
on the selection of the system. Always performance, power requirements,
‘Faster is Better’. memory usage etc.) are the key deciding
• Less/not at all tailored towards reduced factors.
operating power requirements, options Highly tailored to take advantage of the
for different levels of power power saving modes supported by
management. hardware and Operating System
• Response requirements are not time  For certain category of embedded systems
critical. like mission critical systems, the response
time requirement is highly critical.

12 MGRJ, ECE
Embedded Systems Vs General Computing Systems

General Purpose System Embedded System

• Need not be deterministic in execution  Execution behavior is deterministic for


behavior. certain type of embedded systems like
‘Hard Real Time’ systems.

13 MGRJ, ECE
What is this?

14 MGRJ,ECE,RVCE
Embedded Everywhere !
 A Day in The Life Rebrand Final.wmv

15 MGRJ,ECE,RVCE
Characteristics of Embedding Computing Applications
 Application and Domain Specific.
 Reactive and Real time
 Operates in Harsh environment
 Distributed
 Small size and Weight
 Power Concerns
 Compact Systems…

16 MGRJ,ECE,RVCE
Characteristics…
 Application and Domain Specific Systems
 Embedded systems are not general-purpose computers.
 Optimized for a specific application.
 Many of the job characteristics are known before the hardware is designed
which allows the designer to focus on the specific design constraints of a well
defined application.
 Embedded S/W usually cannot run on other embedded systems without
modification.
 Hardware tailored to an application.
– Unnecessary circuitry is eliminated
– Resources shared if possible.

17 MGRJ, ECE
Characteristics …

 Reactive & Real time Systems


 Typical embedded systems model responds to the environment via sensors
and control the environment using actuators.
 The embedded systems are expected to run at the speed of the
environment, this characteristic is called “reactive”.
 Reactive computation means that the system executes in response to
external events.
 External events can be either periodic or aperiodic.

18 MGRJ, ECE
Real & Real time systems…
 One of the biggest challenges for embedded system designers is,
performing an accurate worst case design analysis on systems with
statistical performance characteristics (e.g., cache memory on a DSP
or other embedded processor).
 Accurately predicting the worst case may be difficult in complicated
architectures.
 Real time system operation means the timing behavior of the system
should be deterministic. Real Time system should not miss deadline.
Ex: Mission control, Flight control system etc…

19 MGRJ, ECE
Characteristics …
 Harsh environment
 Many embedded systems do not operate in a controlled
environment.
 Excessive heat is often a problem, especially in applications involving
combustion (e.g., Automobile applications).
 Protection from vibration, shock, lightning, power supply
fluctuations, water, corrosion, fire, and general physical damage.
 Embedded system designer is to model accurately the different
parameters of harsh environment in real world.

20 MGRJ, ECE
Characteristics …
 Small and low weight
 Many embedded computers are physically located within some larger system.
 Challenge is to develop non-rectangular geometries for certain solutions.
 Weight can also be a critical constraint.

 Power Concerns
 When controlling physical equipment, large current loads may need to be
switched in order to operate motors and other actuators.
 Designer must carefully balance system tradeoffs among analog components,
power, mechanical, network, and digital hardware with corresponding software.
 Power management need to be considered in designing embedded system.
System should be designed in such a way as to minimize the heat dissipation by
the system. More power consumption less battery life.
21 MGRJ, ECE
Characteristics …
 Distributed Systems
 A set of nodes connected by the network, cooperating to achieve a common goal
- Node: a µC + I/O + communication interface.
- One or multiple networks: wired, wireless.
Ex: Embedded systems in automobiles

22 MGRJ,ECE,RVCE
Real Time Systems/Services
 The real time service is triggered by a real world event and produces a
corresponding system response, how long this transformation of input to
output takes is a key design issue.
 The real time services are often implemented by integrating H/W and S/W
components.
 The real time systems either polls sensors on a periodic basis, or the sensor
components provide digitized data on a known sampling interval with an
interrupt generated to the controller.
 The real time systems are categorized into Hard real time and Soft real
time systems based on time of completion.

23 MGRJ, ECE
Real Time Systems….
Service Response Timeline

24 MGR,ECE,RVCE
Steps involved in the Design
The Design happens in three steps mainly:
1. Modeling: is the process of gaining a deeper
understanding of a system through imitation. Models
express what a system does or should do.
2. Design : is the structured creation of hardware &
Software. It specifies how a system does what it does.
3. Analysis: is the process of gaining a deeper understanding
of a system through dissection. It specifies why a system
does what it does (or fails to do what a model says it
should do).

MGRJ, ECE Source: NPTEL Course “Embedded System Design” by Arnab Sarkar, IIT, Guwahati
25
Design Process….
 Each design step is consisting of number of operations.

Modelling Analysis
• Requirements Design
• Functionality test
• Specifications • Architecture design
-Architecture Selection:
• Objective and closeness
• The product requirements are • Choice of processing elements functions defined by
captured from the customer and • standard/custom/semi-custom combining metrics like power,
converted into system level needs HW area, etc are evaluated.
or processing requirements. • Memories, Interfacing, • If it is not close to the
• English (or other natural communication
language) is common starting expected value, the design
• Hardware software portioning
point. and modelling processes are
• Hardware software codesign
• Computation models are used to reiterated.
capture the behaviour. • Component design
• In Real Time(RT) systems,
E.g: Sequential program model. • System Integration
timing behaviour must be
Finite state machine model.
Communicating process verified in addition to
model.
MGRJ,ECE,RVCE
functional correctness.
26
Example: An Elevator Controller
 Partial English Description
“Move the elevator either up or down to reach the requested floor. Once at the
requested floor, open the door for at least 10 seconds, and keep it open until the
requested floor changes. Ensure the door is never open while moving. Don’t change
directions unless there are no higher requests when moving up or no lower requests
when moving down…”

Source: NPTEL Course “Embedded System Design” by Arnab Sarkar, IIT, Guwahati
27 MGRJ, ECE
Elevator Controller…
 Simple Elevator Controller
‒Request Resolver resolves various floor requests into
single requested floor.
‒Unit Control moves elevator to this requested floor.

28 MGRJ, ECE
Modelling: Sequential Program Model
Declarations:

29 MGRJ, ECE
Modelling: Finite State Machine(FSM) model
 FSM for UnitContol process.
 FSM model is described by considering systems with:
- Possible states:
E.g: Idle, GoingDown, GoingUp, DoorOpen
- Possible transitions from one state to another based on input
E.g: req> floor
- Actions that occur in each state
E.g:In the GoingUpstate, u,d,o,t= 1,0,0,0 (up = 1, down, open, and timer_start= 0)

30 MGRJ, ECE
FSM model…
 UnitControl process using a state machine.

31 MGRJ, ECE
Hardware Software Partitioning
 Many functions can be done by software on a general purpose microprocessor OR by
hardware on an application specific ICs (ASICs)
E.g: Game console graphic, PWM, PID control(Hardware).
 Leads to Hardware/Software Co-design concept.
 Where to place functionality?
E.g: A Sort algorithm Faster in hardware, but more expensive.
More flexible in software but slower.
 Designer must be able to explore these various trade-offs:
▪ Speed.
▪ Reliability.
▪ Cost
▪ Form (size, weight, and power constraints.)
32 MGRJ, ECE
Hardware Software Partitioning…
 Move “bottleneck” computations from software to hardware.

Hardware Implementation

33 MGRJ, ECE
Source: http://class.ece.iastate.edu/cpre488/lectures/Lect-08.pdf
Example:
FIR Filter

MGRJ, ECE Source: http://class.ece.iastate.edu/cpre488/lectures/Lect-08.pdf


34
Hardware Software Partitioning…

Source: NPTEL Course “Embedded System Design” by Arnab Sarkar, IIT, Guwahati
35 MGRJ, ECE
Hardware Software Co-design(FPGA Synthesis)

Source: NPTEL Course “Embedded System Design” by Arnab Sarkar, IIT, Guwahati
36 MGRJ, ECE
Tutorial 1
Problem 1
Design an 8051 based HONEY BEE COUNTER with following specifications. The
bees are assumed to enter the bee hive in rectangular box through a small hole.
Another hole is made for the bees to exit. Assume suitable sensors are placed at
entry & exit holes. The system is designed to display the number of bees in hive at
any time. Assume initially there are no bees in hive.
Write block diagram & pseudo code of above system implementation.

37 MGRJ, ECE
Tutorial 1

Problem 2
Design an 8051 based system to control temperature of the furnace with following
specifications. The furnace temperature has to maintain at 30±10C. Connect
suitable sensors & actuators. Display the temperature on LCD. The power
consumption has to be minimized. Show the design & implementation (diagram
+Program).

38 MGRJ, ECE
Tutorial 1

Problem3
The 8051 MCUs are used for control automation of chemical plant. An 8051
MCU is used to control the liquid flow of blast furnace. Another 8051 MCU is
used control the temperature of blast furnace. The liquid level & temperature of
blast furnace is displayed in master room powered by another 8051 MCU.
Design a scheme for connecting above mentioned 8051 MCUs using full duplex
serial communication with variable baud rate. Select any one controller as a master
and assign address to Slaves. Write block diagram showing connection of
controllers. It is learnt that master communicate with a slave every 10ms. Write
ALP for master to perform this communication.

39 MGRJ, ECE
Hardware Components
Embedded Processor: ISAs
 An instruction set, or instruction set architecture (ISA), is the part of the
processor architecture related to programming.
 All processors are supported by instruction set /instructions (Assembly instructions)
which are dependent on organization of different components in PLU.
 Depending upon the way of supporting different instructions, the ISA is divided into
-Reduced Instruction Set Computer(RISC)
-Complex Instruction Set Computer(CISC)
 Other types of ISAs
-Very Long Instruction Word(VLIW), etc.

40 MGRJ, ECE
CISC & RISC Design Philosophy: CISC Vs RISC
CISC RISC
 More number of instructions  Lesser no. of instructions.
 Instructions are complex to  Instructions are Easier to
understand. understand.
• Hardware support for many • Software support for many
instructions (More silicon Usage) instructions/operations.
A programmer can achieve the desired (Less silicon usage)
functionality with a single instruction Programmer needs to write more code to
which in turn provides the effect of using execute a task since the instructions are
more simpler single instructions in RISC simpler ones
• Clock cycles per instruction(CPI)  Clock cycles per instruction(CPI) is
is more. less.
41 MGRJ,ECE,RVCE
CISC & RISC Design Philosophy: CISC Vs RISC
CISC RISC
 Code density is more.  Code density is less.
 Less number of registers.  More number of registers.
 Memory to memory operations are  No memory to memory
supported. operations are supported.
Load & store operations in a instruction Load & store operations not in
a instruction ( So called as load-store
architecture)

• More number of addressing modes. • Less number of addressing modes.


• Variable length instructions. • Fixed length instructions.
• Design of Pipelining is Complex. • Design of Pipelining is easier.

42 MGRJ,ECE,RVCE
CISC & RISC Design Philosophy: CISC Vs RISC
CISC RISC
 Non Orthogonal Instruction  Orthogonal Instruction
Set Set
All instructions are not allowed to Allows each instruction to operate on
operate on any register and use any any register and use any addressing
addressing mode. It is instruction mode.
specific. • Examples: ARM, MSP 430, PIC
• Examples: 8086
POWERPC

NOTE: The fact is, the designers are not worried about the architecture(CISC/RISC). So,
the features from both the architectures are mixed up to increase the performance(Increase
speed & reduce memory consumption).

43 MGRJ,ECE,RVCE
Memory Architecture :Von Neumann & Harvard Architecture
 This classification is based on processor architecture design to support memory.
 Address Space:
- No. of locations a processor/controller can address.
E.g: 8086: Address bus=20 bits, so address space is 1 Mb
(00000H-FFFFFH)
8051: Address bus=16 bits, so address space is 64 Kb
(0000h-FFFFh)

44 MGRJ,ECE,RVCE
Von Neumann/Princeton Architecture
 In this architecture, address space is shared between program memory & data memory.
 E.g: 8086
-Total Address space is 1Mb
- The address space is segmented(shared) in to code
segment(Program memory) and data segment (data memory).
 Common memory for program & data.
 Single shared bus(Address, data & control: System bus) for Instruction and Data fetching.
Pgm /data
memory

 The speed of execution is less because sharing of bus.


45
 The complexity of design of processor is less because single bus.
Harvard Architecture
 In this architecture, address space is not shared between program memory & data
memory.
 E.g:8051
- Total address space for program memory is 64Kb & for
data memory is 64Kb.
- Program memory & data memory locations are separate.
 Separate memory for program & data.
 Separate buses for Instruction and Data fetching.

 The speed of execution is more because of separate buses.


 The processor design is complex.
46 MGRJ,ECE,RVCE
Memory Map of
LPC 1857 MCU

Source: User Manual

47 MGRJ, ECE
Memory Organization
 Many of the processors and
controllers have memories On chip
arranged in some form of
KB* < 1 ns*
hierarchy.
 The fastest memory is physically
located near the processor core MB* 10-30 ns*
L 2 Cache
and the slowest memory is set
GB*
further away.
TB* ~1 ms *
 Generally, the closer memory is to
the processor core, the more it
costs and the smaller its capacity.
 The figure shows typical memory ~100 ms *
Off chip
hierarchy.
* Access times *Capacity
48 MGRJ,ECE,RVCE Source: “ARM System Developer Guide” by Andrew N Sloss
Memory organization…
 The registers are internal to the processor core and provide the fastest possible memory
access in the system.
 At the primary level, tightly coupled memory (TCM) and level 1 cache are connected
to processor core using dedicated on-chip interfaces.
 The TCMs are not subjected to eviction(no replacement of contents during program
execution) and cache is subjected to eviction, hence cache accessing may result in data
miss.
 The main memory include volatile components like SRAM and DRAM, and non-volatile
components like flash memory. The purpose of main memory is to hold programs while
they are running on a system.
 The next level is secondary storage a large, slow, relatively inexpensive mass storage
devices such as disk drives or removable memory.

49 MGRJ,ECE,RVCE
Memory Hierarchy…

 The Memory Hierarchy is developed based on a program behavior known as locality


of references.
 Spatial locality: The probability of processor accessing adjacent locations of current
access is more.
E.g: If processor access a location X at time instant t, in future time instants
(t+1,t+2, so on ), probability of processor accessing locations X+1,X+2 and so
on is more.
 Temporal locality: If a processor is accessing a location, probability of processor
accessing same location in future time instants is more.
E.g:If processor access a location X at time instant t, in future time instants (t+1,t+2,
so on ), probability of processor accessing location X is more.

50 MGRJ,ECE,RVCE
Memory Hierarchy…

 Typical Access Patterns

Source: NPTEL course on “Advanced Computer Architecture” by Dr. John Jose,IIT,Guhawati


51 MGRJ,ECE,RVCE
Memory Hierarchy Fundamentals
 Units of data transfer:
Pages

Flash Memory

 Block: Larger data unit in terms of several bytes


 Pages: Combination of several blocks.
 Hit: A data referenced by a processor is present at a level, then it is a hit. Otherwise, the data
accessing is a miss.
 Hit Time : Time to access the cache memory block and return the data to the processor.
 Hit Rate / Miss Rate: Fraction of memory access found (not found) in the level.
 Miss Penalty : Time to replace a data in level with the corresponding block from the next
level.
 Avg. Memory Access Time (with L1 cache)
AMAT = hit time of L1 + miss-rate * miss-penalty

52 MGRJ,ECE,RVCE
Tutorial 1
 Perform capacity planning for a two level memory hierarchy system. The first level, M1 is
a cache with three capacity choices 64 Kbytes, 128 Kbytes and 256 Kbytes. The second
level, M2 is a main memory with a 4 Mbyte capacity. Let C1 and C2 be the cost per byte
and t1 and t2 the access times for M1 and M2 respectively. Assume C1=20C2 and t2=10t1.
The cache hit ratios for the three capacities are assumed to be 0.7, 0.9 and 0.98
respectively.
i) What is the average access time ta in terms of t1=20ns in the three cache designs?
ii) Express the average byte cost of the entire memory hierarchy if C2=$0.2/Kbyte.
iii) Compare the three memory designs and indicate the order of merit in terms of
average costs and average access times respectively.
Choose the optimal design based on the product of average cost & average access times.

53 MGRJ,ECE,RVCE
Tutorial 1
 Consider a three level memory hierarchy with following specifications:

Memory level Access Time Capacity Cost/Kbyte


Cache t1=25 ns s1=512 KB c1=$ 1.25
Main memory t2=unknown s2=32 MB c2=$0.2
Disk array t3= 4 ms s3=unknown c3=$0.0002

Design the memory hierarchy to achieve an effective memory access time t=10.04us
with cache hit ratios h1=0.98 and a hit ratio h2=0.9 in the main memory. Also limit the
total cost of the memory hierarchy is upper bounded by $15,000.

54 MGRJ,ECE,RVCE
Cache
 Cache is a small, fast buffer (SRAM) between processor and memory.
 Old values will be removed from cache to make space for new values and works on the
principle of locality.
 CPU-Cache interaction:
• The tiny, very fast CPU register file
➢ The transfer unit between the has room for four 4-byte words.
CPU register file and the cache
is a 4-byte word. • The small fast L1 cache has room
➢ The transfer unit between the for two 4-word blocks.
cache and main memory is a 4-
word block (16B).
• The big slow main memory has
room for many 4-word blocks.

55 MGRJ,ECE,RVCE
Source: NPTEL course on “Advanced Computer Architecture” by Dr. John Jose,IIT,Guhawati
Cache Organization
 Cache is an array of sets(S).
 Each set contains one or
more lines(E).
 Each line holds a block of
data(B) bytes.
 Cache Size=S x E x B bytes.

56 MGRJ,ECE,RVCE
Addressing Caches

 The word at address A is in the cache if


the tag bits in one of the <valid> lines in
set <set index> match <tag>.
 The word contents begin at offset <block
offset> bytes from the beginning of the
block.
57 MGRJ,ECE,RVCE Source: NPTEL course on “Advanced Computer Architecture” by Dr. John Jose,IIT,Guhawati
Addressing Caches

 Locate the set based on <set index>.


 Locate the line in the set based on <tag>
 Check that the line is valid
 Locate the data in the line based on
MGRJ,ECE,RVCE
<block offset>.
58
Tutorial 3
 A cache has 512KB capacity,4B word,64B block size and 8 way set associative. The
system is using 32 bit address. Given the address 0xABC89984, which set of cache
will be searched and specify which word of the selected cache block will be
forwarded if it is a hit in cache?
# of sets=512K/8x64=1024 sets(Set index=10 bits).
Block size= 64 Bytes=16 words=4 bits(word index)+2 bits (byte offset)=6 bits
Addressing Scheme:

59 MGRJ,ECE,RVCE
Suggested reading:
 Direct mapped cache/Set Associative cache

60 MGRJ, ECE
Virtual Memory
 Virtual memory is a memory management capability of an OS that uses hardware and
software to allow a computer to compensate for physical memory shortages by
temporarily transferring data from RAM to disk storage.
E.g. Consider a program with 8 pages(typically 4 KB or 8 KB) and resides in a secondary
storage as shown below and physical memory is 16 pages.
Program Disk Storage Physical memory
 The MMU creates an
illusion that there exist
infinitely large pool of
memory in the system. So, a
program with size more
than physical memory size
can be executed.
61 MGRJ, ECE
Virtual Memory: How it works?
 The Memory Management Unit(MMU) of OS manages transfer of data between
RAM and disk.
 The MMU relocates the virtual address into physical address.

62 MGRJ, ECE
IO Subsystem
 Embedded system are interfaced with different IOs to communicate and control
external world.
 Data transfer to and from the peripherals to CPU may be done in any of the three
possible ways:
 Busy wait IO or Programmed IO or Polling.
 Interrupt Driven IO.
 Direct Memory Access( DMA) based IO.
 Memory-mapped IO allows IO registers to be accessed as memory locations. As a
result, these registers can be accessed using only LOAD and STORE instructions.
 IO mapped IO allows IO to be accessed using separate instructions IN and OUT
provided by ISAs.

63 MGRJ, ECE
IO Subsystem….
Busy wait IO
 It requires constant monitoring by the CPU of the peripheral devices.
 A transfer from IO device to memory requires the execution of several
instructions by the CPU, including an input instruction to transfer the data from
device to the CPU and store instruction to transfer the data from CPU to
memory.
 The CPU stays in the program loop until the IO unit indicates that it is ready for
data transfer.
 Due to the time needed to poll if IO device is ready, the processor cannot
often perform useful computation.

64 MGRJ, ECE
Busy wait IO..
Example: ADC 0809 Interface to 8051 MCU.
Polling loop: Pseudo Code
void main(void)
{
/* MCU & ADC Initialization
while(1)
{

/*Send Start of Conversion


//wait for End of Conversion(EoC)
//from ADC
while(eoc==0);
 EOC is connected to GPIO pin configured as
/* Read digital data on port 0
input. }
}

65 MGRJ, ECE
IO Subsystem….
Interrupt Driven IO
 By using interrupt facility and special interface to issue an interrupt request
signal whenever data is available from IO device, CPU can be interrupted for
processing.
 In the meantime, the CPU can proceed for any other program execution.
 Whenever it is determined that the IO device is ready for data transfer, the
interface(device itself) initiates an interrupt request signal to the computer.
 The CPU stops momentarily the task that it was already performing, branches to
the Interrupt Service Routine(ISR) to process the IO transfer, and then
return to the task it was originally performing.

66 MGRJ, ECE
Interrupt Driven IO..
Example: ADC 0809 Interface to 8051 MCU. Interrupt Driven IO: Pseudo Code
void main(void)
{
/* MCU & ADC Initialization
/* Interrupt Initialization
while(1)
{

/*Send Start of Conversion


//CPU is doing any other operation
 EOC is connected to Interrupt input of MCU. //No operation, can be kept in sleep modes
 Response time involves context switching }
overhead. }
void isr_ADC (void)interrupt 0
{
//Read digital data
67 MGRJ, ECE }
DMA based IO
IO Subsystem….
 For fast, bulk data transfer between memory and IO devices is DMA is used.
 The DMA eliminates CPU intervention between IO devices and memory in data
transfer.
 During DMA the CPU is idle and it has no control over the memory buses.
 The DMA controller takes over the buses to manage the transfer directly between
the I/O devices and the memory unit.
DMA is not operational DMA is operational

68 MGRJ, ECE
DMA IO..
 Bus Request (HOLD): It is used by the DMA controller to request the CPU to
relinquish the control of the buses.
 Bus Grant (HLDA): It is activated by the CPU to Inform the external DMA controller
that, the buses are in high impedance state and the requesting DMA can take control of the
buses.
 Burst Transfer :In which, a block sequence consisting of memory words is transferred in
a continuous burst where the DMA controller is the master of the memory buses.
 Cyclic Stealing : In this, DMA controller transfers one word at a time after which it
must return the control of the buses to the CPU. The CPU merely delays its operation for
one memory cycle to allow the direct memory IO transfer to “steal” one memory cycle.

69 MGRJ, ECE
Tutorial 2
Question
 Choose suitable method to support an IO device. Discuss the criterion to select the
method.
 Suggested Reading:
DMA unit of LPC 1857.

70 MGRJ, ECE
Tutorial 2
IO subsystem: Design IO Subsystem
 The LM75A is an industry-standard digital temperature sensor with an integrated
sigma-delta analog-to-digital converter (ADC) and I2C interface. The LM75A
provides 9-bit digital temperature with an accuracy of ±2°C from –25°C to 100°C
and ±3°C over –55°C to 125°C.

71 MGRJ, ECE
Source: LM75A data sheet
Hardware Accelerators & Coprocessors
 We can use hardware accelerators and coprocessing to create more efficient, higher
throughput designs.
 Hardware accelerators are dedicated fixed-function peripherals designed to
perform a single computationally intensive task over and over.
 They offload the main processor with general purpose instruction set,
allowing it to do general-purpose tasks.
 Application accelerator is not a new concept.
E.g. 8087 Intel Math coprocessor released in 80’s.
 But, it received a renewed interest around 2002 due to the single thread
performance stall.
-Frequency scaling became unsustainable with smaller IC feature sizes.
-Instruction-level parallelism (IPL) can go only so far.
72 MGRJ, ECE
Hardware Accelerators..
Analog Devices SHARC® ADSP-2146x
 SHARC® ADSP-2146x processor incorporates hardware accelerators for
implementing three widely used signal processing operations: FIR (finite impulse
response), IIR (infinite impulse response), and FFT (fast Fourier transform).

 The ADSP-2146x core has a maximum clock rate of 450 MHz. By using SIMD (single-
instruction multiple-data), the core can perform two MAC (multiply-accumulate)
operations per clock cycle for a peak rate of 900 MMAC/sec.

 The accelerator in comparison, operates at the clock rate of 225 MHz. Using its four
dedicated MAC units, the FIR accelerator achieves a peak theoretical throughput of
900 MMAC/sec.

Source:White paper on hardware accelerators in SHARC processors by Paul Beckmann, DSP Concepts, LLC.
74 MGRJ, ECE
Analog Devices SHARC® ADSP-2146x

 Consider a home theatre system with 7.1 channels of audio at 96 kHz operating at a
block size of 32 samples. Assume that room equalization is being applied by 8 FIR
filters, each 512 points long.
 No. of MAC operations: 8 x 512 x 96KHz=393 MAC/sec.
 If the core CPU were to perform the filtering, it would take 44% of a 450 MHz
SHARC processor.
 This FIR processing represents a significant portion of the overall computation of
CPU and fortunately can be offloaded to the accelerator.

75 MGRJ, ECE
ARM NEON Hardware Accelerators..
 Arm NEON technology is an advanced SIMD (single instruction multiple data) architecture
extension for the Arm Cortex-A series and Cortex-R52 processors.
 NEON technology is intended to improve the multimedia user experience by accelerating
audio and video encoding/decoding, user interface, 2D/3D graphics or gaming.
 NEON instructions allow up to:

• NEON can be used multiple ways, including NEON enabled


libraries, compiler's auto-vectorization feature, NEON intrinsics,
and NEON assembly code.

76 MGRJ, ECE Source: www.arm.com


Tutorial 2
 A new multimedia unit (MU) that is added in a processor speeds up the
completion of multimedia instructions given to the processor by 4 times.
Assuming a program has 40% multimedia instructions, what is the overall
speedup gained while running the program when it is executed on the
processor with the new MU than when it is run on the processor without this
MU?( Use Amdahl’s Law)
1
 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = 𝐹
1−𝐹 +
𝑠
 F= Fraction Enhanced S=Speed up

77 MGRJ, ECE
Enhancing Performance of Processors
Pipelining
 A pipeline is a cascaded connection of processing stages which are connected to
perform a fixed function over a stream of data flowing from one end to the other.
 In modern CPUs, the pipelines are applied for instruction execution, arithmetic
computation and memory access operations.
 The pipeline is constructed with 𝑘 processing stages. The processed results are passed
from 𝑆𝑖 to stage 𝑆𝑖+1 for all 𝑖 = 1,2, … … . . 𝑘 − 1.
𝑆𝑖 = 𝑠𝑡𝑎𝑔𝑒 𝑖
𝐿 = 𝐿𝑎𝑡𝑐ℎ
𝜏 = 𝐶𝑙𝑜𝑐𝑘 𝑃𝑒𝑟𝑖𝑜𝑑
𝜏𝑚 = 𝑀𝑎𝑥 𝑆𝑡𝑎𝑔𝑒 𝑑𝑒𝑙𝑎𝑦
𝑑 = 𝐿𝑎𝑡𝑐ℎ 𝑑𝑒𝑙𝑎𝑦

78 MGRJ, ECE
Source: Kai Hwang, “Advanced Computer Architecture”, Tata Mcgraw Hill Education.
Clock Cycle(𝝉)
Pipelining…
τ = max τi ∀ i = 1,2 … . k + d = t m + d
Pipeline Frequency or Maximum Throughput
1
𝑓=
𝜏
 Ideally, one result is expected to come out of pipeline per cycle.
 However, depending on the initiation rate of successive tasks actual throughput of the
pipeline will be lower than 𝑓.
Speedup(𝑺𝒌 )
 Ideally, a pipeline with 𝑘 stages can process 𝑛 tasks in 𝑘 + 𝑛 − 1 clock cycles. Where, 𝑘
cycles are needed to complete the execution of the very first task and remaining (𝑛 − 1)
tasks require (𝑛 − 1) cycles.
 Total time required: 𝑇𝑘 = [𝑘 + (𝑛 − 1)]𝜏

79 MGRJ, ECE
Speedup….
 Consider an equivalent function nonpipelined processor which has a flow through delay
of 𝑘𝜏.
 Total time required: 𝑇𝑙 = 𝑛𝑘𝜏.
𝑇𝑙 𝑛𝑘𝜏 𝑛𝑘
 Speedup factor 𝑆𝑘 = = =
𝑇𝑘 𝑘+ 𝑛−1 𝜏 𝑘+ 𝑛−1

 The maximum speedup is 𝑆𝑘 → 𝑘 as 𝑛 → ∞. The maximum speedup is very difficult to achieve


because of data dependences between successive tasks, program branches, interrupts, etc.
 For small values of 𝑛, the speedup is very poor
as shown in figure.
 For larger number of 𝑘, the higher the potential
of speedup performance.

80
MGRJ, ECE
Speedup….
 However, number stages cannot be increased indefinitely because practical constraints
on cost, implementation complexity, circuit implementation, etc.
 The figure shows optimal number of pipeline stages(performance cost ratio Vs number
stages).

 In practice, most pipelining is staged with 2 ≤ 𝑘 ≤ 15. Very few pipelines are
designed to exceed 10 stages in real computers.

81 MGRJ, ECE
Speedup….
 The efficiency 𝐸𝑘 is defined:
𝑆𝑘 𝑛
𝐸𝑘 = = 𝐸𝑘 → 1 as 𝑛 → ∞ (𝑈𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑)
𝑘 𝑘+ 𝑛−1
1
𝐸𝑘 → as 𝑛 = 1 (𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑)
𝑘
 The Pipeline Throughput 𝐻𝑘 is defined as the number of tasks performed per unit
time.
𝑛 𝑛𝑓
𝐻𝑘 = = = 𝐸𝑘 . 𝑓
𝑘+ 𝑛−1 𝜏 𝑘+ 𝑛−1

Maximum throughput 𝒇 occurs when 𝐸𝑘 → 1 𝑎𝑠 𝑛 → ∞.

82 MGRJ, ECE
Instruction Pipeline
 ARM Cortex M3 three stage pipeline:
- The figure below shows 3 stage pipeline of Cortex M3.
- The Fetch stage fetches instructions from memory, presumably one per cycle.
-The Decode stage reveals the instruction function to be performed and
identifies resources needed .
-The instructions are executed in Execute stage.

Cycles 1 2 3 4 5 ……….

83 MGRJ, ECE
Instruction Pipeline….
 A seven stage instruction pipeline
-The figure below show a seven stage pipeline with three Execute(E) stages.
-The Issue(I) stage reserves resources and control pipeline interlocks.
- The Writeback(W) stage used to write results back into the registers.

- Assume pipelined execution of high level language statements:


X=Y+Z and A=B x C
- These macro operations will be converted to several assembly statements.

84 MGRJ, ECE
Seven stage Instruction Pipeline….

 Assuming this architecture, the pipelined execution is shown below.


 The figure illustrates the issue of instructions following original program order(In
order execution). The shaded boxes correspond to idle cycles when instruction
issues are blocked due to resource latency or data dependencies.

85 MGRJ, ECE
Seven stage Instruction Pipeline….

 The following figure shows an improved timing after the instruction issuing order
is changed(out of order execution) to eliminate unnecessary delays due to
dependence.

86 MGRJ, ECE
Enhancing Performance of Processors
Superscalar Execution
 In a superscalar execution, multiple instruction pipelines are used. This implies multiple
instructions are issued per cycle and multiple results are generated per cycle.
 Superscalar processors are designed to exploit more instruction level parallelism in
user programs.
 Only independent instructions can be executed in parallel without causing a wait state.
 The amount of instruction level parallelism varies widely depending on type of code
being executed.
 The instruction issue degree in a super scalar processor has been limited to 2 to 5 in
practice (Average number of instructions to be executed in parallel is 2 without loop
unrolling).

87 MGRJ, ECE
Superscalar Execution….

 The figure shows tripe issue superscalar processor with degree 𝑚 = 3.


 Due to desire for higher degree of instruction
level parallelism in programs, the superscalar
processor depends more on compiler to
exploit parallelism.
E.g: IBM’s POWERPC

Time in cycles

88 MGRJ, ECE
Enhancing Performance of Processors
VLIW architecture
• The Very Long Instruction Word (VLIW) architecture uses more
functional units.
• The CPI of VLIW processor is less compared to CISC & RISC processors.
• 256 or 1024 bits per instruction word.
• Programs are written in conventional short instruction words.
• The code compaction must be done by compiler.
• Instruction parallelism and data movement in a VLIW architecture are
completely specified at the compile time.

89 MGRJ, ECE
VLIW architecture…

 A typical VLIW processor with instruction format in shown in figure.

90 MGRJ, ECE
VLIW architecture…

 A typical VLIW execution with degree 3 is shown in figure.

91 MGRJ, ECE
Enhancing Performance of Processors
Multi core CPUs
 Power and frequency limitations observed on single core implementations have
paved the gateway for multicore technology.
 The frequency in single core CPUs is limited to 4GHz, as any increase beyond this
frequency increases power dissipation.
 A Multi-core processor is typically a single processor which contains several
cores on a chip.
 The individual cores on a multi-core processor don’t necessarily run as fast as the
highest performing single-core processors, but they improve overall performance by
handling more tasks in parallel.

92 MGRJ, ECE
Multi core CPUs..

 The multiple cores inside the chip are not clocked at a higher frequency, but instead their
capability to execute programs in parallel is ultimately contributes to the overall
performance making them more energy efficient and low power cores as shown in the
figure below.
 Multi-core processors could also be implemented as a combination of both
homogeneous and heterogeneous cores.
 In homogeneous core architecture, all the cores in the CPU are identical and they
improve the overall processor performance
by breaking up a high computationally intensive
application into less computationally intensive
applications and execute them in parallel.
E.g: AMD Dual cores & Intel Core2 Duo and
Quad Cores.
93 MGRJ, ECE
Source: Intel Higher Education Program & FAER.
Multi core CPUs..

 In heterogenous multicores consist of dedicated application specific processor


cores that would target the issue of running variety of applications to be executed
on a computer.
 An example could be a DSP core addressing multimedia applications that require
heavy mathematical calculations, a complex core addressing computationally
intensive application and a remedial core which addresses less computationally
intensive applications.
E.g:TI OMAP(ARM Core+DSP core), QualCom Snapdragon
Challenges with multicores:
 Majority of applications used today were written to run on only a single processor failing
to use the capability of multi-core processors.

94 MGRJ, ECE
Challenges with multicores:

 The delay of on-chip interconnects is becoming a critical bottle-neck in meeting


performance of multi-core chips. The performance of the processor truly
depends on how fast a CPU can fetch data rather than how fast it can operate on
it to avoid data starvation scenario.
 The Multiple cores accessing shared data simultaneously may lead to a timing
dependent error known as “data race condition”. In a multi-core environment, data
structure is open to access to all other cores when one core is updating it. In the event
of a secondary core accessing data even before the first core finishes updating the
memory, the secondary core faults in some manner.
 The multi-cores interaction between on chip components viz. cores, memory
controllers and shared components viz. cache and memories where bus contention and
latency are the key areas of concern.

95 MGRJ, ECE
CPU Benchmarking standards
 MIPS(Million Instructions Per Second)
𝐶𝑙𝑜𝑐𝑘 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑀𝐼𝑃𝑆 =
𝐶𝑃𝐼∗1,000,000
 MIPS is only an approximation as to a processors performance because some
processor instructions do more work than others with an instruction.
 A computer rated at 100 MIPS may be able to compute certain values faster than
another computer rated at 120 MIPS.
 Dhrystone MIPS: Dhrystone is a standard program consisting of arithmetic &
logical operations on integers and is used to benchmark CPU.

98 MGRJ, ECE
MIPS…
Tutorial 4
 The execution times (in seconds) of three programs on three MCUs are given
below:

Execution Time (in seconds)


Program
MCU A MCU B MCU C
Program 1 10 1 20
Program 2 1000 200 20
Program 3 500 200 20

 Assume that 100,000,000 instructions were executed in each of the three


programs. Calculate the MIPS(Million instructions Per Second) rating of each
program on each of the three machines. Based on these ratings, draw a clear
conclusion regarding the relative performance of the three computers.

99 MGRJ, ECE
CPU Benchmarking standards
MFLOPS(Mega Floating Point Operations per Second)

 A floating-point operation is an addition, subtraction, multiplication, or division


operation applied to a number in a single or double precision floating point
representation.
 Clearly, a MFLOPS rating is dependent on the program. Different programs require
the execution of different numbers of floating-point operations.
 MFLOPS has a stronger claim than MIPS to being a fair comparison between
different computers. The key to this claim is that the same program running on
different computers may execute a different number of instructions but will always
execute the same number of floating-point operations.

100 MGRJ, ECE


Coremark CPU Benchmarking standards

 CoreMark® is a benchmark that measures the performance of microcontrollers (MCUs)


and central processing units (CPUs) used in embedded systems.
 Replacing the Dhrystone benchmark, Coremark contains implementations of the
following algorithms: list processing (find and sort), matrix manipulation, state machine
(determine if an input stream contains valid numbers), and CRC (cyclic redundancy
check).
 It is designed to run on devices from 8-bit microcontrollers to 64-bit microprocessors.

101 MGRJ, ECE


 Suggested reading
MMAC,
Coremark: https://www.eembc.org/coremark/
specint2006,specfp2006

102 MGRJ, ECE

You might also like