Generalised Parallel Bilinear Interpolation Archit
Generalised Parallel Bilinear Interpolation Archit
Generalised Parallel Bilinear Interpolation Archit
net/publication/221437206
CITATIONS READS
19 3,743
1 author:
Suhaib A Fahmy
King Abdullah University of Science and Technology
144 PUBLICATIONS 2,666 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Architecture Centric Coarse-Grained FPGA Overlays for High Performance Computing View project
All content following this page was uploaded by Suhaib A Fahmy on 07 August 2014.
Suhaib A. Fahmy
Centre for Telecommunications Value-chain Research
Trinity College Dublin
Ireland
[email protected]
Abstract
(floor(x),floor(y)) (x,floor(y)) (ceil(x),floor(y))
Bilinear interpolation is widely used in computer vision
for extracting pixel values for positions that lie off the pixel
grid in an image. For each sub-pixel, the values of four frac(y)
neighbours are used to compute the interpolated value. This
presents a challenge since four pixels must be read from (x,y)
the source image memory for each output pixel. This paper
presents an architecture, for implementation within FPGA- frac(x)
based vision systems, that takes advantage of the hetero-
geneous resources available on modern devices to paral-
lelise these memory accesses through efficient distribution
of the source image in embedded memories. We show (floor(x),ceil(y)) (floor(x),y) (ceil(x),ceil(y))
how intrinsic information in the sub-pixel addresses can
be used to implement bilinear interpolation efficiently. We Figure 1. The bilinear interpolation grid.
then suggest modifications to the architecture for larger im- Given a point (x,y), we interpolate between
age sizes which exceed the memory capabilities of modern floor(x) and ceil(x) on each of two rows
FPGAs. The architecture is shown to achieve performance floor(y) and ceil(y) using the fractional part
of 250Msamples per second in a modern device. of x to obtain two intermediate points shown.
Then we use the fractional part of y to inter-
polate between these two points to obtain the
1. Introduction final interpolated value.
interpolation suffers in speed terms because it requires ac- and routing resources for other parts of a design.
cess to four source pixels in order to compute a sub-pixel A typical application of bilinear interpolation within a
value. This typically means four memory accesses and thus vision system could become the bottleneck in speed terms
a reduction in the speed of pixel extraction by a factor of due to the increased memory access requirements over near-
four over nearest neighbour approximation, which results est neighbour approximation. Even if a vision processing
in slowing down the surrounding vision system [2]. system is able to run at high speeds within the FPGA fab-
Other interpolation methods exist, such as bicubic and ric through thorough pipelining, a standard sequential ap-
sinc interpolation. While they offer even better quality in proach to bilinear approximation would reduce the speed to
terms of the transformed image, the computational require- a quarter of its potential. The aim of this design is to com-
ments are significantly increased. Hence, they are typically pute one sub-pixel value in each clock cycle within a stream
used for the final display of images, as opposed to the inter- of images, and thus run as fast as the rest of the system..
nals of a vision system, for which speed is of paramount im- In this paper we propose a generalised architecture for
portance. Bilinear interpolation is still widely used within bilinear interpolation that can be integrated into any com-
vision systems. Note that while a system typically accepts puter vision application implemented on FPGA. This ar-
input frames and produce output frames at standard rates, chitecture parallelises pixel accesses through distribution of
the processing rates within the system can be much higher. the source image across embedded memories. It accepts
As an example, the Trace transform system in [3], processes a stream of sub-pixel addresses for which it computes the
30 frames per second, but each of these frames accesses data interpolated values in a single-cycle delay pipeline. This
in 180 transformed frames. Hence it is unfeasible to use ex- architecture can find use within the wide array of applica-
ternal hardware, or even to access external memory within tions. We explore a method that can use embedded memo-
the processing pipeline. ries for images of small size, then propose how the architec-
FPGAs have found significant use in the field of image ture could be tweaked for processing images of large sizes,
and video processing. A number of advanced techniques like Full HD (1920× 1080 pixels).
which otherwise map poorly to CPUs and GPUs can be ac-
celerated by designing custom hardware architectures that
exploit algorithmic parallelism. FPGAs also provide the 2. Previous Work
ideal platform for testing and prototyping, given the ease
with which a design can be modified. Many computer vi- A standard technique for bilinear interpolation fetches
sion algorithms require pixel approximations, as described each of the four required pixels for a sub-pixel address se-
above, somewhere within the datapath. Indeed some of rially [2]. This means that four memory accesses are re-
our previous work on the Trace transform [3] used nearest- quired per output pixel value. This can slow down process-
neighbour approximation, and the design proposed in this ing significantly and preclude real-time performance. This
paper could be integrated into that work. is typical of solutions in the software world and considering
Modern FPGAs provide a range of heterogeneous re- the fast memory accesses and efficient caching schemes, is
sources for use by the designer. Hard multipliers and DSP sufficient in such a context. In software, the interpolation
blocks are useful in a variety of signal processing applica- calculation itself is also serial.
tions, as are embedded memory blocks. These hard blocks In cases where the interpolation factor is fixed, it is possi-
offer guaranteed performance, while also freeing up logic ble to apply a simple 2-dimensional filter to the image. This
332
might occur when an image is being expanded by a fixed memory 0
be half the sum of the two adjacent pixels, it is a relatively e,e e,o e,e e,o e,e e,o (b)
straightforward implementation.
o,e o,o o,e o,o o,e o,o
Hardware architectures for bilinear interpolation have memory 0
typically not veered too far from the serial accesses men- e,e e,o e,e e,o e,e e,o memory 1
tioned, though pipelining the interpolation itself has been o,e o,o o,e o,o o,e o,o
memory 2
333
necessary to implement a double buffer. We can assume
that the input to the system is a continuous stream of pixels
in raster scan format, for one image followed immediately
by the next. In this case, it is necessary that we distribute the
pixel
pixels for the first image amongst the memories on the first
mem1
side of the double buffer, while the pixels from the second
side of the double buffer are used by the interpolation en- addr
gine. Once a whole image has been written to the first side, dbuffsel
the roles are swapped, and the next image is written into the mem2
second side, while the first side is used for processing. Note
ycoord[0] addr
that the sub-pixel addresses fed into the system will extract
ycoord[8:0]
values for the previously stored frame.
For the purposes of clarity, the remainder of this pa- xcoord[8:0] ycoord[8:1]&xcoord[8:0]
334
ycoord[15:8]&xcoord[15:7]
ycoord[15:8]&(xcoord[15:7] + 1)
mem1 pixela
ycoord[7]
× +
addra
– ycoord[6:0]
addrb
pixelb
swap?
int_pixel
pixelc
xcoord[6:0]
– × +
mem2
addrc
addrd
– × +
pixeld
Figure 5. The circuit for calculating the interpolated sub-pixel. Bit positions within the sub-pixel
address are used as per Table 1. Two intermediate horizontally interpolated points are computed in
parallel. The swap unit decides which is the uppermost pixel in the interpolation based on the LSB
of the y-coordinate. The circuit is fully pipelined.
512×512 pixel image would require 2.1Mbits of memory Concatenating unrelated data which is accessed using
to store. A Full HD image at 1920×1080 pixels would re- an identical pattern is an efficient way of parallelising. In
quire a full 16.6Mbits. Considering the need for double- [3] we showed how concatenating four orthogonal rotations
buffering due to the real-time requirements of the system, within one memory location could provide four rotations of
these numbers must be doubled. Today’s most advanced, an image in the time normally taken for a single rotation.
largest capacity devices from both Xilinx (Virtex 5) [10] The performance of such a system, based on external
and Altera (Stratix IV) offer at most 20Mbits of memory memory, is clearly dependent on the speed of the mem-
on chip. The memories available on devices are typically ory banks. The Celoxica RC300 board has 6 SRAM banks,
spread throughout the chip and thus suit integration within accessible at speeds of around 80MHz. This would trans-
a larger vision system, but we must consider the limitations late to 80 Mpixels/second. Faster boards exist today, but
capacity places on our proposed architecture. not having access to such hardware, we cannot confirm the
precise performance on such boards. The I/O connectivity
For situations where the on-die memory is insufficient, it requirements are in line with many development boards in
is possible to adapt this architecture to use external memo- production.
ries. To simplify the implementation, ZBT SRAMs would Some other solutions, when dealing with specialised
be required, allowing pipelined single-cycle memory ac- problems can also be investigated. If the sub-pixel ad-
cess. In such a system, the storage part of the architec- dresses are requested in some predictable order, and are se-
ture would store to external rather than internal memory. A quentially local, then it would be possible to process parts of
double buffer would also be required. One difficulty is the the image in separate runs. For example, if applying image
number banks of memory required. Since external SRAM zooming, subsequent pixels will still be in the same rows, so
memory is usually single-ported, four memories would be we could process blocks of rows together. The architecture
required to allow for the parallel accesses, and further 4 we present here makes no assumptions about the sub-pixel
for the double-buffer. One way this can be overcome, and addresses being requested and thus maintains full flexibility
bearing in mind the fact that SRAM memories are typically for the vision system making use of it.
wide (>32 bits), is to concatenate neighbouring pixels in
a single memory location. As an example, when storing
the pixel for location (y1 , x1 ), we store both (y1 , x1 ) and 4. Results
(y1 , x1 + 1) concatenated. That way, when that location
is read, the two neighbouring pixels are available immedi- The architecture was implemented in VHDL, making use
ately. This reduces the memory requirements to 4 external of Xilinx CoreGen memory cores in order to provide max-
SRAM banks, which is more reasonable. imum performance. The circuit was synthesised, mapped,
335
Resource Type Amount (%) the computer vision part will be significantly faster.
Slices 362 (1%) The system was tested by preparing serialised input files
18Kbit Block Memories 256 (80%) of image data and sub-pixel coordinates in MATLAB and
DSP48 Blocks 3 (1%) using these as the input to a VHDL testbench.
Minimum clock period 3.994ns
5. Conclusion
Table 2. Synthesis results for a Xilinx Virtex 4
We have presented a generalised architecture for bilin-
XC4VSX55.
ear interpolation for use within vision applications imple-
mented on FPGAs. The architecture distributes source im-
placed and routed using Xilinx ISE 10.1. The target device age pixels across multiple memories which are then ac-
was a Xilinx Virtex 4 XC4VSX55 FPGA. The SX family cessed in parallel. By using intrinsic data in the sub-pixel
includes a higher proportion of embedded memories as re- addresses, interpolations can be computed using minimal
quired for this application. As has been mentioned above, hardware resources, freeing up most of the device’s logic
the logic requirements for the architecture are extremely fabric and embedded DSP blocks for implementing the re-
minimal, while the memory requirements are substantial for mainder of the vision system. The architecture achieved
an FPGA application. Considering the proposed context for a clock speed of 250MHz on a Xilinx Virtex 4. We also
this architecture, this platform is ideally suited, since a sig- suggested how the architecture could be modified for use
nificant portion of the logic fabric, a large number of em- with larger images that exceed the available memory on the
bedded DSP blocks, and a reasonable number of memory FPGA.
blocks are available for the implementation of the vision
portion of a complete system. References
In order to extract maximum performance from the de-
sign, it was highly pipelined, including at the inputs and [1] O-matrix webpage (http://www.omatrix.com/benchipt.pdf).
outputs to the embedded memories. Since the embedded [2] J. Baltes and J. Anderson. Interpolation methods for global
Block Memories are small (holding 18Kbits each), a single vision systems. In RoboCup 2004: Robot Soccer World Cup
system-level memory is in fact distributed throughout the VIII, 2004.
[3] S. Fahmy, C.-S. Bouganis, P. Cheung, and W. Luk. Real-
device fabric. Hence routing can become a problem if the
time hardware acceleration of the trace transform. In Journal
memory outputs are not registered. We found that the un- of Real-Time Image Processing, volume 2, pages 235–248,
registered version of the design had a maximum clock speed 2007.
of 133MHz. After registering, the clock speed increased to [4] R. Gonzalez and R. Woods. Digital Image Processing. Pear-
250MHz, or 250 Msamples/sec. son Prentice Hall, 2007.
The resource requirements are summarised in Table 2. [5] K. Gribbon and D. Bailey. A novel approach to real-time
At 250MHz, this architecture can process 512 × 512 pixel bilinear interpolation. In IEEE International Workshop on
images at speeds equivalent to 950 frames per second. The Electronic Design, Test and Applications, 2004.
[6] K. Gribbon, C. Johnston, and D. Bailey. A real-time FPGA
memory requirements of this architecture vary linearly with
implementation of a barrel distortion correction algorithm
image size. The largest Virtex 5 device, the XC5VSX240T, with bilinear interpolation. In Image and Vision Computing
contains sufficient memory for processing 1024 × 1024 New Zealand, pages 408–413, 2003.
pixel images. Clearly, for larger images that would re- [7] Z. Haihua, X. Zhiqi, and C. Guanghua. VLSI implemen-
quire external memory, the resultant speed would depend tation of sub-pixel interpolator for h.264/avc encoder. In
on the maximum access speeds for the external memory. International Symposium on High Density Packaging and
For comparison against software, we consider a tool called Microsystem Integration, 2007.
O-Matrix [1], that is used in industrial applications. It ex- [8] F. Kelly and A. Kokaram. Fast image interpolation for mo-
hibits a huge performance advantage over MATLAB for im- tion estimation using graphics hardware. In Proceedings of
age processing functions. For image rotation1 with bilinear the SPIE, pages 184–194, 2004.
[9] P. Uthaichana and E. Leelarasmee. A pipelined bilinear in-
interpolation, they claim performance of 15ms (as opposed
terpolation for real time video image expansion. In IEEE
to MATLAB’s 2000ms) for a 1280 × 960 image (equivalent Region 10 Conference, 2004.
to 8.2Mpixels/second). Our system is over 30 times faster. [10] Xilinx Inc. Virtex-5 User Guide, 2007.
We must also factor in that for more complicated systems [11] D. Zhou and P. Liu. A hardware-efficient dual-standard vlsi
the speedup over software will be more significant, since architecture for mc interpolation in avs and h.264. In IEEE
International Symposium on Circuits and Systems (ISCAS),
1 Image rotation is chosen as it is a very simple operation which uses 2007.
interpolation.
336