High Level Synthesis With Catapultc: Michal Stala

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

High Level Synthesis with CatapultC

MICHAL STALA
Motivation
What you learn in the DSP-course is actually used in the
real world
Introduction to Lab 2/3
Topics
Introduction to Catapult
High Level Synthesis (HLS) advantages
Catapult Demo
Introduction normal design flow?

Netlist
Design (VHDL) Synthesis
(VHDL)
Introduction Catapult design flow?

Catapult SystemC/
C++ code
Intermediate
Catapult
netlist (VHDL)

Constraints

Netlist
Synthesis
(VHDL)
Introduction Catapult
Writing software C++code for hardware design
Successful projects require HW engineers not SW
engineers
It is important to know HW concepts to get optimal results
The following concepts that you have learned in the DSP
course are used in the Catapult SW
Folding/Unfolding
Re-timing/Pipelining
Bit-level optimization
and more
HLS advantages
Verification/debug

QuestaSim VS SW debugger

Both can be used when developing in CatapultC


Most functionality can be debugged in SW
Speedup of ~1000x
Reduced design time beh. model of
RAM
library ieee;!
use ieee.std_logic_1164.all;!
use ieee.std_logic_arith.all;!
use ieee.std_logic_unsigned.all;!
--------------------------------------------------------------!
entity SRAM is!
generic( !width: !integer:=4;!
! !depth: !integer:=4;!
! !addr: !integer:=2);!
port( !Clock: ! !in std_logic; !!
!Enable: ! !in std_logic;!
!Read: ! !in std_logic;!
!Write: ! !in std_logic;!
!Read_Addr: !in std_logic_vector(addr-1 downto 0);!
!Write_Addr: !in std_logic_vector(addr-1 downto 0); !
!Data_in: !in std_logic_vector(width-1 downto 0);!
!Data_out: !out std_logic_vector(width-1 downto 0)!
);!
end SRAM;!
--------------------------------------------------------------!
!
architecture behav of SRAM is!
!
-- use array to define the bunch of internal temparary signals!
!
type ram_type is array (0 to depth-1) of !
!std_logic_vector(width-1 downto 0);!
signal tmp_ram: ram_type;!
Begin ! ! ! ! !
-- Read Functional Section!
process(Clock, Read)!
begin!
!if (Clock'event and Clock='1') then!
! if Enable='1' then!
! !if Read='1' then!
! ! -- buildin function conv_integer change the type!
! ! -- from std_logic_vector to integer!
! ! Data_out <= tmp_ram(conv_integer(Read_Addr)); !
! !else!
! ! Data_out <= (Data_out'range => 'Z');!
! !end if;!
! end if;!
!end if;!
end process;!
!!
-- Write Functional Section!
process(Clock, Write)!
begin!
!if (Clock'event and Clock='1') then!
! if Enable='1' then!
! !if Write='1' then!
! ! tmp_ram(conv_integer(Write_Addr)) <= Data_in;!
! !end if;!
! end if;!
!end if;!
end process;!
end behav;!
Reduced design time - TB
library IEEE;!
use IEEE.std_logic_1164.all;!
use IEEE.std_logic_arith.all;!
use ieee.std_logic_unsigned.all;!
!
entity MEM_TB is ! ! -- entity declaration!
end MEM_TB; !
!
--------------------------------------------------------------------!
!
architecture TB of MEM_TB is !
!
component SRAM is !!
port( !Clock: ! !in std_logic; !!
!Enable: ! !in std_logic;!
!Read: ! !in std_logic;!
!Write: ! !in std_logic;!
!Read_Addr: !in std_logic_vector(1 downto 0);!
!Write_Addr: !in std_logic_vector(1 downto 0); ! ! ! !
!Data_in: !in std_logic_vector(3 downto 0);!
!Data_out: !out std_logic_vector(3 downto 0)!
); !!
end component;!
!
signal T_Clock, T_Enable, T_Read, T_Write: std_logic;!
signal T_Data_in, T_Data_out: std_logic_vector(3 downto 0);!
signal T_Read_Addr: std_logic_vector(1 downto 0);!
signal T_Write_Addr: std_logic_vector(1 downto 0);!
!
begin !
!!
U_CKT: SRAM port map (T_Clock, T_Enable, T_Read, T_Write,!
! !T_Read_Addr, T_Write_Addr, T_Data_in, T_Data_out);!
!
Clk_sig: process!
begin!
T_Clock<='1'; ! ! -- clock cycle 10 ns!
wait for 5 ns;!
T_Clock<='0';!
wait for 5 ns;!
end process;!
! ! ! ! ! ! !!
process !
variable err_cnt: integer := 0;!
begin!
!!
!T_Enable <= '1'; !
!T_Read <= '0';!
!T_Write <= '0';!
!T_Write_Addr <= (T_Write_Addr'range => '0');!
!T_Read_Addr <= (T_Read_Addr'range => '0');!
!T_Data_in <= (T_Data_in'range => '0'); ! !!
!wait for 20 ns;!
!!
!-- test write ! !!
!for i in 0 to 3 loop!
! T_Write_Addr <= T_Write_Addr + '1';!
! T_Data_in <= T_Data_in + "10";!
Reduced design time C equivalent
!
static int mem[512]; //Only one row to define!
void mem_write(int &value, int &index){!
mem[index] = value; //Only one row for write !
}!
!
int main(){!
value = 123;!
index = 0;!
mem_write(value, index);!
}!
Low-level design also possible with
SystemC
SC_MODULE(sc_method_comb_example){
sc_in<bool > clk;
sc_in<bool > rst;
sc_in<sc_int<8> > a;
sc_in<sc_int<8> > b;
sc_out<sc_int<9> > dout;

SC_CTOR(sc_method_comb_example):clk("clk"),
rst("rst"),a("a"),b("b"),dout("dout")
{
SC_METHOD(exec);
sensitive << a << b;
}
void exec(){
sum = a.read() + b.read();
dout.write(sum);
}
private:
sc_int<9> sum;
};
Other advantages
Low- and High-level design optimize where needed
Reduce time for design changes
Efficient SW/HW co-simulation
The same env. as SW, for example gcc
Same code different environments

HLS synthesis tool

HLS model SW verification

HW/SW co-simulation
VHDL design restricted to requirements
Frequency requirement
System clock is 100Mhz and all block should used
this frequency
ASIC target lib
Target is for example 28nm CMOS

Problem when a HW block is re-used in future


projects since the VHDL code is optimized for other
requirements!
Catapult constraints
One C++/SystemC codebase but different constraints on:
Frequency
ASIC target lib
And other

C++ code can be re-used in future projects when


requirements change!
Catapult advantages summary
Easier and faster verification
Reduce design time
Reduce time for design changes
Efficient SW/HW co-simulation
Easy design exploration
Easy optimization exploration
ASIC technology adaptive
Target frequency adaptive
Easy to maintain, easy to reuse
Object Oriented Language
Catapult demo introduction
Demo 1: Pipeline
Demo 2: Unfolding

Both demos are small simple examples to demonstrate


the possibilities with Catapult.

Think big!

What effort would it take to pipeline or unfold huge


designs with billions of gates?
Catapult pipeline
z_in
x_in

X + out

y_in
void f(int &x_in, int &y_in, int &z_in, int &out){!
out = x_in*y_in+z_in;!
}!

!
Catapult pipeline

z_in
x_in

X + out

y_in
Where do we put the cutset?
Catapult pipeline
cutset

z_in D
x_in

X D + out

y_in
Time for demo!
Imagine a huge design
Difficulty finding the optimal pipeline solution (largest ASICs use
5B gates in 2013)
All the cutset possibilities are difficult to analyze
All optimizations wasted when using another tech node or
frequency
Catapult unfolding
Unfolding or parallel processing
Speed up the computation by a factor: J
Area penalty
Parallel 3-Tap

y(n) = b0 x (n) + b1 x (n 1) + b2 x(n 2)



y(n + 1) = b0 x (n + 1) + b1 x (n) + b2 x (n 1)
n changed to 2k

y(2k ) = b0 x (2k ) + b1 x(2k 1) + b2 x (2k 2)

y(2k + 1) = b0 x(2k + 1) + b1 x(2k ) + b2 x (2k 1)


Parallel 3-Tap (2)
y(2k ) = b0 x(2k ) + b1 x(2k 1) + b2 x(2k 2)

y(2k + 1) = b0 x(2k + 1) + b1 x(2k ) + b2 x(2k 1)
x(2k+1) x(2k)
x(2k-1)
D
k=0
x(2k-2)
D
b0 b1 b2
y(2k)

b0 b1 b2
y(2k+1)
Parallel 3-Tap (3)
x(5) x(4)
x(3) x(2) y(0) = b0 x (0) + b1 x (1) + b2 x (2)
x(1) x(0)
y(1) = b0 x (1) + b1 x (0) + b2 x (1)
x(2k+1) x(2k)
x(2k-1)
2 Inputs D
k=0
x(2k-2)
D
b0 b1 b2
y(2k)
y(0) y(2) y(4)

b0 b1 b2
y(2k+1)
y(1) y(3) y(5)
Time for Catapult magic!

You might also like