Programming Vision Applications On Zynq Using Opencv and High-Level Synthesis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Kees Vissers

MPSoC, July 2013


Programming vision applications on Zynq using
OpenCV and High-Level Synthesis
MPSC 2013
Video And Vision processing
MPSoC 2013
Consumer Video
Displays
Medical Display
Machine Vision
A&D UAV
Studio / Cinema
Camera
Office-class MFP
Digital Signage
Video Conferencing
Driver Assist
HD Surveillance
A&D UAV
Broadcast
Reference Monitor
From Pixels
to information
Page 2
Required Pixel Rate Processing vs. Capabilities
Page 3
Time
2013
2005
2000
1995
Video Format
480 240 Gops
M Pix/s
Pixel rate
120 60 Gops
20 10 Gops
Hi Res. Display Pixel Rate Processing Exceeds RISC Based Capabilities
500 ops
Per Pixel
RISC GPU FPGA
Programmable
Solution
MPSoC 2013
Page 4
Processors and Pipelines
1000:1 100:1 10:1 1:1 clock:sample
200Ks/s 2Ms/s 20Ms/s 200Ms/s Data Rate
(200MHz clock)
RISC
Proc.
Proc. w/
accels.
Folded
datapath
Pipelined
datapath
Design
approach
Applications
control audio mobile video HDTV comms networking
HLS tools
1:10
2 Gs/s
Replicated
datapath
FPGAs
Processors
MPSoC 2013
Page 5
Zynq Products in context for video
1000:1 100:1 10:1 1:1 clock:sample
200Ks/s 2Ms/s 20Ms/s 200Ms/s Data Rate
(200MHz clock)
RISC
Proc.
Proc. w/
accels.
Folded
datapath
Pipelined
datapath
Design
approach
Applications frame rate processing line rate processing HDTV pixel rate ...
1:10
2 Gs/s
Replicated
datapath
Arm A9 processors
1-2 Gops
Fabric
10 500 Gops
MPSoC 2013
Zynq platform
Page 6
2x GigE
with DMA
2x USB
with DMA
2x SDIO
with DMA
Static Memory Controller
Quad-SPI, NAND, NOR
Dynamic Memory Controller
DDR3, DDR2, LPDDR2
AMBA

Switches
I/O
MUX
MIO
ARM

CoreSight Multi-core & Trace Debug


512 KB L2 Cache
NEON/ FPU Engine
Cortex-A9 MPCore
32/32 KB I/D Caches
NEON/ FPU Engine
Cortex-A9 MPCore
32/32 KB I/D Caches
Snoop Control Unit (SCU)
Timer Counters 256 KB On-Chip Memory
General Interrupt Controller DMA Configuration
2x SPI
2x I2C
2x CAN
2x UART
GPIO
Processing System
AMBA

Switches
AMBA

Switches
AMBA

Switches
Programmable
Logic:
System Gates,
DSP, RAM
XADC
PCIe
Multi-Standards I/Os (3.3V & High Speed 1.8V)
M
u
l
t
i
-
S
t
a
n
d
a
r
d
s

I
/
O
s

(
3
.
3
V

&

H
i
g
h

S
p
e
e
d

1
.
8
V
)
Multi Gigabit Transceivers
MPSoC 2013
Video Board: Zync 702 board
Page 7
MPSoC 2013
Image processing often programmed using streaming and
OpenCV libraries
Processor runs Linux, and a window system (Qt)
On chip performance monitors tied to running Display,
processor load and FPGA to external memory load
Basic idea:
take the edge detect of the current frame and
the edge detect of previous frame
Subtract: if the same: nothing, if different: show fat pixel
Page 8
Programming
MPSoC 2013
while(1){
//Get the image frame from the video input
frame_current = cvQueryFrame(capture);
//Detect edges in the current frame
cvSobel( gray_current, edge_current2, 1, 0, aperature_size );
cvSobel( gray_current, edge_current1, 0, 1, aperature_size );
cvAdd(edge_current2,edge_current1,edge_current2,NULL);
cvConvertScale(edge_current2,edge_current,scale,0);
cvThreshold(edge_current,edge_current,5,255,CV_THRESH_BINARY);
//Detect edges in the previous frame
cvSobel( gray_prev, edge_prev2,1, 0, aperature_size );
cvSobel( gray_prev, edge_prev1,0, 1, aperature_size );
cvAdd(edge_prev2,edge_prev1,edge_prev2,NULL);
cvConvertScale(edge_prev2,edge_prev,scale,0);
cvThreshold(edge_prev,edge_prev,5,255,CV_THRESH_BINARY);
//Detect edges that are only present in the edge_current image
detect_new_edges(edge_prev,edge_current,new_edge);
//Remove noise from the new edges
cvSmooth(new_edge,filtered_new_edges,CV_MEDIAN,7,7);
//Combine new edges with current frame
//Highlight new edges in red
highlight_blend(frame_current,filtered_new_edges,output_frame);
//Copy current frame into previous frame
cvCopy(frame_current,frame_prev,NULL);
//Display output frame
cvShowImage(Detector Output,output_frame);
}
Video Input
Subsystem
Application Code Application on Zynq
ARM A9
Processor
Video Output
Subsystem
Source Code Mapping
SW
FPGA
FABRIC
Memory
Subsystem
Memory
Subsystem
FPGA
FABRIC
FPGA
FABRIC
FPGA
FABRIC
MPSoC 2013
Page 9
while(1){
//Get the image frame from the video input
frame_current = cvQueryFrame(capture);
//Detect edges in the current frame
cvSobel( gray_current, edge_current2, 1, 0, aperature_size );
cvSobel( gray_current, edge_current1, 0, 1, aperature_size );
cvAdd(edge_current2,edge_current1,edge_current2,NULL);
cvConvertScale(edge_current2,edge_current,scale,0);
cvThreshold(edge_current,edge_current,5,255,CV_THRESH_BINARY);
//Detect edges in the previous frame
cvSobel( gray_prev, edge_prev2,1, 0, aperature_size );
cvSobel( gray_prev, edge_prev1,0, 1, aperature_size );
cvAdd(edge_prev2,edge_prev1,edge_prev2,NULL);
cvConvertScale(edge_prev2,edge_prev,scale,0);
cvThreshold(edge_prev,edge_prev,5,255,CV_THRESH_BINARY);
//Detect edges that are only present in the edge_current image
detect_new_edges(edge_prev,edge_current,new_edge);
//Remove noise from the new edges
cvSmooth(new_edge,filtered_new_edges,CV_MEDIAN,7,7);
//Combine new edges with current frame
//Highlight new edges in red
highlight_blend(frame_current,filtered_new_edges,output_frame);
//Copy current frame into previous frame
cvCopy(frame_current,frame_prev,NULL);
//Display output frame
cvShowImage(Detector Output,output_frame);
}
Video Input
Subsystem
Application Code Application on Zynq
ARM A9
Processor
Video Output
Subsystem
Source Code Mapping
Memory Subsystem
Memory
Subsystem
Sobel
Current Edge
Sobel
Prev Edge
detect_new_edge
cvSmooth
highlight_blend
H
L
S
G
E
N
E
R
A
T
E
D
MPSoC 2013
Page 10
Current Frame
Previous Frame
Output Frame
Detected New Car
OpenCV
Motion Detection
Algorithm
MPSoC 2013
Page 11
External Input/Output and compute
PS7
HDMIinput
HDMI Output
hdmi2axi axivdma
processing
YCbCr2RGB
logicvc
axidma
32b
sync
HP0
HP2
Hsync
Vsync
Clk,de
Hsync
Vsync
Clk,de
2x axidma processing
HP1
32b
Frame Buffer is the application level
abstraction for HDMI input/output
HDMI
to FB
FB to
HDMI
HDMI to Framebuffer
IO IP subsystem
Framebuffer to HDMI
IO IP subsystem
sobel_filter_pass();
sobel_filter();
diff_image();
median_char_filter_pass();
combo_image();
ycbcr2rgb_pad();
MPSoC 2013
Page 12
Exactly the same OpenCV image processing pipe
Using OpenCV libraries optimized for Intel SSE vector
processing
Webcam is 1280 x 720 (720p)
Runs on Intel i7 with ~2.7GHZ processor and 8Gbyte DRAM
Net result in the range of one frame every 1-2 seconds
Demo
Page 13
Laptop demo with webcam
MPSoC 2013
Model train
HDTV camera 1080p 60Frames per second, HDMI
Board, with application running, linux, Qt, processor + Bus -load
Mouse tied to a register to set threshold value dynamically
HDTV, 1080p , HDMI
Page 14
Complete setup:
MPSoC 2013
Power zones
2x GigE
with DMA
2x USB
with DMA
2x SDIO
with DMA
Static Memory Controller
Quad-SPI, NAND, NOR
Dynamic Memory Controller
DDR3, DDR2, LPDDR2
AMBA

Switches
I/O
MUX
MIO
ARM

CoreSight Multi-core & Trace Debug


512 KB L2 Cache
NEON/ FPU Engine
Cortex-A9 MPCore
32/32 KB I/D Caches
NEON/ FPU Engine
Cortex-A9 MPCore
32/32 KB I/D Caches
Snoop Control Unit (SCU)
Timer Counters 256 KB On-Chip Memory
General Interrupt Controller DMA Configuration
2x SPI
2x I2C
2x CAN
2x UART
GPIO
Processing System
AMBA

Switches
AMBA

Switches
AMBA

Switches
Programmable
Logic:
System Gates,
DSP, RAM
XADC
PCIe
Multi-Standards I/Os (3.3V & High Speed 1.8V)
M
u
l
t
i
-
S
t
a
n
d
a
r
d
s

I
/
O
s

(
3
.
3
V

&

H
i
g
h

S
p
e
e
d

1
.
8
V
)
Multi Gigabit Transceivers
INT
BRAM
AUX, ADJ
PINT
MIO
1V5
PAUX
MPSoC 2013
Page 15
Power zones
INT
PINT
AUX
PAUX
ADJ
BRAM
MIO
1V5 = DDR
3V3
2V5
1V5
3V3
2V5
MPSoC 2013
Page 16
Power Measurement output on running board
Power in Watts
Only few measurements/sec
10% noise
10% difference between bit streams
MPSoC 2013
Page 17
All Pixel processing with OpenCV libraries running on ARM A9:
one frame per 13 secs
All Pixel processing with C++ running on A9: 1 frame per 1-2
seconds
All Pixel processing with C++ libraries implemented via HLS in
FPGA: 60 frames per second, FPGA runs 130MHz
A9 processors : 500mW 800mW
FPGA fabric fully running: 500mW 1W
On Chip I/O few hundred mW, on board DRAM 800mW
Result: ~ 100x speedup at SAME power consumption, ~100GOps
Energy efficiency is in the 100 200 Gops/W range for the FPGA
in the complete system!
You can put your finger on the running chip in the system, warm
but not HOT : less than ~2 W!
Page 18
Performance and Power Measurements
MPSoC 2013
Energy Efficiency (MOPS/mWor OP/nJ)
Microprocessors
General
Purpose DSP
Dedicated
3 orders of
Magnitude!
Courtesy Bob Brodersen, based on published results at ISSCC conferences.
MPSoC 2013
Page 19
Energy Efficiency (MOPS/mWor GOPS/W)
Microprocessors
General
Purpose DSP
Dedicated
3 orders of
Magnitude!
Courtesy Bob Brodersen, based on published results at ISSCC conferences.
This application
MPSoC 2013
Page 20
Every processor benefits from a combination with FPGA:
Zynq device
Every FPGA benefits from a combination with a processor:
Zynq device
We have shown an application programmed in OpenCV,
leveraging Vivado HLS running on a 1-2W Zynq device at
1080p 60fps real-time.
Special thanks to the team that worked on the demo:
Jack Lo, Fernando Martinez Vallina, S. Mohan, Vinod Kathail,
and to many colleagues in the Vivado High Level Synthesis
team and the Video Platform teams.
Page 21
Conclusion
MPSoC 2013
OpenCV and HLS video:
http://www.xilinx.com/csi/training/vivado/leveraging-opencv-and-
high-level-synthesis-with-vivado.htm
OpenCV and HLS application note:
http://www.xilinx.com/support/documentation/application_notes/
xapp1167.pdf
Xilinx Zynq 702 board:
http://www.xilinx.com/products/boards-and-kits/
EK-Z7-ZC702-G.htm
http://www.zedboard.org/
Page 22
You can do this too:
MPSoC 2013

You might also like