Generalised Parallel Bilinear Interpolation Archit

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221437206

Generalised Parallel Bilinear Interpolation Architecture for Vision Systems

Conference Paper · December 2008


DOI: 10.1109/ReConFig.2008.15 · Source: DBLP

CITATIONS READS
19 3,743

1 author:

Suhaib A Fahmy
King Abdullah University of Science and Technology
144 PUBLICATIONS 2,666 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Architecture Centric Coarse-Grained FPGA Overlays for High Performance Computing View project

All content following this page was uploaded by Suhaib A Fahmy on 07 August 2014.

The user has requested enhancement of the downloaded file.


2008 International Conference on Reconfigurable Computing and FPGAs

Generalised Parallel Bilinear Interpolation Architecture for Vision Systems

Suhaib A. Fahmy
Centre for Telecommunications Value-chain Research
Trinity College Dublin
Ireland
[email protected]

Abstract
(floor(x),floor(y)) (x,floor(y)) (ceil(x),floor(y))
Bilinear interpolation is widely used in computer vision
for extracting pixel values for positions that lie off the pixel
grid in an image. For each sub-pixel, the values of four frac(y)
neighbours are used to compute the interpolated value. This
presents a challenge since four pixels must be read from (x,y)
the source image memory for each output pixel. This paper
presents an architecture, for implementation within FPGA- frac(x)
based vision systems, that takes advantage of the hetero-
geneous resources available on modern devices to paral-
lelise these memory accesses through efficient distribution
of the source image in embedded memories. We show (floor(x),ceil(y)) (floor(x),y) (ceil(x),ceil(y))
how intrinsic information in the sub-pixel addresses can
be used to implement bilinear interpolation efficiently. We Figure 1. The bilinear interpolation grid.
then suggest modifications to the architecture for larger im- Given a point (x,y), we interpolate between
age sizes which exceed the memory capabilities of modern floor(x) and ceil(x) on each of two rows
FPGAs. The architecture is shown to achieve performance floor(y) and ceil(y) using the fractional part
of 250Msamples per second in a modern device. of x to obtain two intermediate points shown.
Then we use the fractional part of y to inter-
polate between these two points to obtain the
1. Introduction final interpolated value.

When applying computer vision techniques to an image,


it is common to require some method by which to extract approximation is not optimal with this technique. If used for
points that lie off the pixel grid. An algorithm used to trans- image transformation, the deterioration in quality is clearly
form an image, whether by rotation, shifting, zooming or visible, while feature extraction can also be affected.
the like, or access the image in an order or along contours In bilinear interpolation, the value of a sub-pixel is inter-
that are not aligned with the source coordinates will produce polated from its four nearest neighbours linearly. The hor-
sub-pixel sampling points for which values must be calcu- izontal fractional component (of the sub-pixel coordinate)
lated by interpolating from the pixels in the source image. is used to compute two interpolated points that still lie on
In the most basic case, the fractional part of any sub- the horizontal grid, then the vertical fractional component
pixel address is truncated or rounded, so each pixel simply is used to interpolate between these two points. Figure 1
takes the value of the nearest “real” pixel in the source im- shows how this is done for an image. In each case, the lin-
age. This is called “nearest neighbour” approximation. This ear interpolation between two values a and b by fraction k
method is the simplest, and involves no calculation. It also is computed as as (b − a) × k + a. Bilinear interpolation
only requires one pixel from the source image for each sub- offers significantly enhanced image quality over nearest-
pixel which is being calculated; hence it can operate at the neighbour approximation [4]. Figure 2 shows a comparison
full speed of the surrounding circuit. However, the resultant between the two methods when rotating an image. Bilinear

978-0-7695-3474-9/08 $25.00 © 2008 IEEE 331


DOI 10.1109/ReConFig.2008.15
Figure 2. A close-up comparison of nearest neighbour approximation (b) and bilinear interpolation
(c) when rotating a test image (a).

interpolation suffers in speed terms because it requires ac- and routing resources for other parts of a design.
cess to four source pixels in order to compute a sub-pixel A typical application of bilinear interpolation within a
value. This typically means four memory accesses and thus vision system could become the bottleneck in speed terms
a reduction in the speed of pixel extraction by a factor of due to the increased memory access requirements over near-
four over nearest neighbour approximation, which results est neighbour approximation. Even if a vision processing
in slowing down the surrounding vision system [2]. system is able to run at high speeds within the FPGA fab-
Other interpolation methods exist, such as bicubic and ric through thorough pipelining, a standard sequential ap-
sinc interpolation. While they offer even better quality in proach to bilinear approximation would reduce the speed to
terms of the transformed image, the computational require- a quarter of its potential. The aim of this design is to com-
ments are significantly increased. Hence, they are typically pute one sub-pixel value in each clock cycle within a stream
used for the final display of images, as opposed to the inter- of images, and thus run as fast as the rest of the system..
nals of a vision system, for which speed is of paramount im- In this paper we propose a generalised architecture for
portance. Bilinear interpolation is still widely used within bilinear interpolation that can be integrated into any com-
vision systems. Note that while a system typically accepts puter vision application implemented on FPGA. This ar-
input frames and produce output frames at standard rates, chitecture parallelises pixel accesses through distribution of
the processing rates within the system can be much higher. the source image across embedded memories. It accepts
As an example, the Trace transform system in [3], processes a stream of sub-pixel addresses for which it computes the
30 frames per second, but each of these frames accesses data interpolated values in a single-cycle delay pipeline. This
in 180 transformed frames. Hence it is unfeasible to use ex- architecture can find use within the wide array of applica-
ternal hardware, or even to access external memory within tions. We explore a method that can use embedded memo-
the processing pipeline. ries for images of small size, then propose how the architec-
FPGAs have found significant use in the field of image ture could be tweaked for processing images of large sizes,
and video processing. A number of advanced techniques like Full HD (1920× 1080 pixels).
which otherwise map poorly to CPUs and GPUs can be ac-
celerated by designing custom hardware architectures that
exploit algorithmic parallelism. FPGAs also provide the 2. Previous Work
ideal platform for testing and prototyping, given the ease
with which a design can be modified. Many computer vi- A standard technique for bilinear interpolation fetches
sion algorithms require pixel approximations, as described each of the four required pixels for a sub-pixel address se-
above, somewhere within the datapath. Indeed some of rially [2]. This means that four memory accesses are re-
our previous work on the Trace transform [3] used nearest- quired per output pixel value. This can slow down process-
neighbour approximation, and the design proposed in this ing significantly and preclude real-time performance. This
paper could be integrated into that work. is typical of solutions in the software world and considering
Modern FPGAs provide a range of heterogeneous re- the fast memory accesses and efficient caching schemes, is
sources for use by the designer. Hard multipliers and DSP sufficient in such a context. In software, the interpolation
blocks are useful in a variety of signal processing applica- calculation itself is also serial.
tions, as are embedded memory blocks. These hard blocks In cases where the interpolation factor is fixed, it is possi-
offer guaranteed performance, while also freeing up logic ble to apply a simple 2-dimensional filter to the image. This

332
might occur when an image is being expanded by a fixed memory 0

factor, as in [9]. The author applies a 4× zoom (2× in each memory 1


e,e e,o e,e e,o e,e e,o
dimension), by computing one interpolated pixel between
each two adjacent pixels. Since the interpolated pixel will o,e o,o o,e o,o o,e o,o

be half the sum of the two adjacent pixels, it is a relatively e,e e,o e,e e,o e,e e,o (b)
straightforward implementation.
o,e o,o o,e o,o o,e o,o
Hardware architectures for bilinear interpolation have memory 0

typically not veered too far from the serial accesses men- e,e e,o e,e e,o e,e e,o memory 1

tioned, though pipelining the interpolation itself has been o,e o,o o,e o,o o,e o,o
memory 2

done before. In [6, 5], the authors develop an intelligent memory 3


(a)
row caching scheme which maintains a cache of previously (c)
accessed pixels noting that they are reused in computing
subsequent pixels. This allows a single access to memory Figure 3. Mapping pixels in the source image
to provide the required missing pixel. The proposed scheme to memories. The four combinations of pixel
is based on the assumption that the sub-pixel addresses are coordinates are shown in (a). A mapping for
approximately spaced similarly to the source image pixel dual-port memories is shown in (b), while (c)
grid, or closer, and follow raster-scan order. Hence sub- shows a mapping for single-port memories,
sequent sub-pixels will have some neighbours in common, requiring 4 in total for full parallel access.
which can be cached. This holds true for most classes of
transformation such as barrel distortion correction, to which
they apply it. In some cases, where all four pixels cannot be rectly following each other, and the sub-pixel addresses pro-
obtained with a single memory access, a three-pixel inter- vided for each image fall within the time it takes for the
polation is computed instead. image to stream through. Hence for N × N pixel images,
In order to allow for full parallel access to source pixels, N 2 sub-pixel values can be computed at streaming real-time
a number of memory-based techniques can be employed. speed. Note that no assumptions are made about the order
Consider the scenario where the source image is duplicated or pattern of sub-pixel addresses, which allows for complete
in four separate memories, each accessible in parallel. This freedom in defining pixel access schemes. For example, one
would solve the problem of access but at the cost of in- could trace lines or contours in the image.
creased memory requirements, due to the high level of re- The proposed architecture works primarily by rearrang-
dundancy. It is also possible to take advantage of multi port ing the streaming image pixels that enter the system into
memories, but these are not widespread as distinct devices separate memories according to their coordinates, allowing
on development boards. Transferring data off-chip for pro- for parallel access to the four required pixels for the interpo-
cessing can also result in a performance penalty. lation. It is clear that the interpolation uses adjacent pixels
It is worth noting that current state of the art for image in two subsequent rows of an image. Hence for parallel ac-
rendering eschews bilinear, and even bicubic interpolation, cess, we require some way of accessing pixels in subsequent
in favour of filter based methods. H.264 and AVS video rows independently, as well as parallel access to adjacent
standards both make use of FIR filter based interpolation pixels in each row. One method would be to store alternate
methods for rendering. These algorithms are being tack- rows in different memories; the even rows of pixels would
led by hardware designers [7, 11]. Bilinear interpolation is be stored in one memory while the odd rows are stored
still used for sub-pixel extraction for computer vision tasks in another. Then we would also require to store even and
though [8], and hence an architecture that can be integrated odd columns in different memories. Thus, we would have
with computer vision hardware, especially on streaming im- four memories which would contain pixels that fall into the
ages, without adversely impacting performance is of great following four groups of (x, y) coordinates: (even,even),
benefit. (even,odd), (odd,even) and (odd,odd), shown in Figure 3.
Now for any interpolation point, we could access the four
3. Proposed Architecture memories in parallel to obtain the required pixels.
The embedded memories in modern FPGAs allow for a
The proposed architecture is designed to address the fol- further optimisation; since these memories can be used in a
lowing problem: given a piece of vision hardware that re- dual-port configuration, it is possible to use only two mem-
quires the computation of sub-pixel values for which it pro- ories. Storing alternate rows separately would still allow
vides addresses, and a stream of input images, compute the subsequent pixels from the same row to be accessed in par-
sub-pixel values in real-time, in the order in which the ad- allel. This simplifies the architecture somewhat.
dresses are provided, assuming that new images stream di- In order to allow the system to run uninterrupted, it is

333
necessary to implement a double buffer. We can assume
that the input to the system is a continuous stream of pixels
in raster scan format, for one image followed immediately
by the next. In this case, it is necessary that we distribute the
pixel
pixels for the first image amongst the memories on the first
mem1
side of the double buffer, while the pixels from the second
side of the double buffer are used by the interpolation en- addr
gine. Once a whole image has been written to the first side, dbuffsel

the roles are swapped, and the next image is written into the mem2

second side, while the first side is used for processing. Note
ycoord[0] addr
that the sub-pixel addresses fed into the system will extract
ycoord[8:0]
values for the previously stored frame.
For the purposes of clarity, the remainder of this pa- xcoord[8:0] ycoord[8:1]&xcoord[8:0]

per will deal with an implementation applied to a 512×512


pixel 8 bits per pixel greyscale image. This assumption will Figure 4. One side of the double-buffer used
be explored further in Section 3.2. We will also focus pri- for distributing incoming pixels between the
marily on implementations using dual-port memories. two intermediate memories. The LSB of the
y-coordinate selects which of the two memo-
ries a pixel is written to. (The two sides of the
3.1. Addressing Scheme
buffer are identical in the real circuit.)

In order to correctly select which buffer memory to write


Address Bits Interpretation
a pixel to, and which memories to read the source pixels
ycoord[15:8] Row part of mem address
from for interpolation, the pixel coordinates need to be ma-
ycoord[7] Which row is first in interpolation
nipulated. First, on the storage side, a row and column
ycoord[6:0] Vertical interpolation factor
counter are kept up to date with each pixel that enters the
xcoord[15:7] Column part of mem address
system. The “new frame” pulse resets these so they main-
xcoord[6:0] Horizontal interpolation factor
tain the correct values for each pixel. Since subsequent
rows must be stored in different memories, we can use the
least significant bit (LSB) of the row value to select between Table 1. Interpreting the sub-pixel address for
them. In the case of dual-port memories, this is sufficient. interpolation of 512 × 512 pixel images.
(If only single-port memories, are available, then we must
also select based on the LSB of the column value.) The rest
of the row and column values (minus the LSB) gives the and swap them for addresses with a one LSB.
address at which to store the pixel in the relevant memory. The fractional parts of each sub-pixel address are used
This is shown in Figure 4 for the linear interpolation further downstream. Since we
On the interpolation side, we must now consider how we now have access to all four pixels required in a single clock
interpret a sub-pixel address. Recall that we will always in- cycle, we can use a pipeline to interpolate, providing a full
terpolate between subsequent rows and columns. However, result in each clock cycle. The first part of the pipeline inter-
it is possible that the upper row is odd or even, and similarly, polates between the two adjacent pixels in each of the two
the left column could be odd or even. rows (maintaining the ordering scheme mentioned above),
In the case of dual-port memories, adjacent pixels within by using the fractional part of the x coordinate, giving two
the row can be accessed simply by passing two subsequent row interpolation values. The next part of the pipeline in-
addresses to the two ports (while correcting the edge-cases terpolates between these two intermediate values, using the
to ensure the coordinates are valid). In the case of single- fractional part of the y coordinate to give the final interpo-
port memories we must also consider which memory is the lated value. Table 1 summarises the addressing scheme and
leftmost. Firstly, we extract the integer parts of the sub-pixel Figure 5 shows the pipeline used for interpolation.
address. The LSBs of these integer parts (x and y coordi-
nates) determine whether the odd or even row or column are 3.2. Alternative Considerations
uppermost or leftmost. If the LSB is a zero, then we are in-
terpolating between an even row and the following odd row. While modern FPGAs contain significantly more mem-
We pass the integer part of the address, minus the LSB to ory on die than in the past, the amount available is still lim-
both (or, in the case of single-port memories, all four) mem- ited when considering the caching of images. Even when
ories, then maintain the order for addresses with zero LSB, we consider a greyscale image of 8 bits per pixel (bbp), a

334
ycoord[15:8]&xcoord[15:7]

ycoord[15:8]&(xcoord[15:7] + 1)

mem1 pixela

ycoord[7]
× +
addra

– ycoord[6:0]

addrb
pixelb
swap?
int_pixel

pixelc
xcoord[6:0]
– × +
mem2

addrc

addrd
– × +
pixeld

Figure 5. The circuit for calculating the interpolated sub-pixel. Bit positions within the sub-pixel
address are used as per Table 1. Two intermediate horizontally interpolated points are computed in
parallel. The swap unit decides which is the uppermost pixel in the interpolation based on the LSB
of the y-coordinate. The circuit is fully pipelined.

512×512 pixel image would require 2.1Mbits of memory Concatenating unrelated data which is accessed using
to store. A Full HD image at 1920×1080 pixels would re- an identical pattern is an efficient way of parallelising. In
quire a full 16.6Mbits. Considering the need for double- [3] we showed how concatenating four orthogonal rotations
buffering due to the real-time requirements of the system, within one memory location could provide four rotations of
these numbers must be doubled. Today’s most advanced, an image in the time normally taken for a single rotation.
largest capacity devices from both Xilinx (Virtex 5) [10] The performance of such a system, based on external
and Altera (Stratix IV) offer at most 20Mbits of memory memory, is clearly dependent on the speed of the mem-
on chip. The memories available on devices are typically ory banks. The Celoxica RC300 board has 6 SRAM banks,
spread throughout the chip and thus suit integration within accessible at speeds of around 80MHz. This would trans-
a larger vision system, but we must consider the limitations late to 80 Mpixels/second. Faster boards exist today, but
capacity places on our proposed architecture. not having access to such hardware, we cannot confirm the
precise performance on such boards. The I/O connectivity
For situations where the on-die memory is insufficient, it requirements are in line with many development boards in
is possible to adapt this architecture to use external memo- production.
ries. To simplify the implementation, ZBT SRAMs would Some other solutions, when dealing with specialised
be required, allowing pipelined single-cycle memory ac- problems can also be investigated. If the sub-pixel ad-
cess. In such a system, the storage part of the architec- dresses are requested in some predictable order, and are se-
ture would store to external rather than internal memory. A quentially local, then it would be possible to process parts of
double buffer would also be required. One difficulty is the the image in separate runs. For example, if applying image
number banks of memory required. Since external SRAM zooming, subsequent pixels will still be in the same rows, so
memory is usually single-ported, four memories would be we could process blocks of rows together. The architecture
required to allow for the parallel accesses, and further 4 we present here makes no assumptions about the sub-pixel
for the double-buffer. One way this can be overcome, and addresses being requested and thus maintains full flexibility
bearing in mind the fact that SRAM memories are typically for the vision system making use of it.
wide (>32 bits), is to concatenate neighbouring pixels in
a single memory location. As an example, when storing
the pixel for location (y1 , x1 ), we store both (y1 , x1 ) and 4. Results
(y1 , x1 + 1) concatenated. That way, when that location
is read, the two neighbouring pixels are available immedi- The architecture was implemented in VHDL, making use
ately. This reduces the memory requirements to 4 external of Xilinx CoreGen memory cores in order to provide max-
SRAM banks, which is more reasonable. imum performance. The circuit was synthesised, mapped,

335
Resource Type Amount (%) the computer vision part will be significantly faster.
Slices 362 (1%) The system was tested by preparing serialised input files
18Kbit Block Memories 256 (80%) of image data and sub-pixel coordinates in MATLAB and
DSP48 Blocks 3 (1%) using these as the input to a VHDL testbench.
Minimum clock period 3.994ns
5. Conclusion
Table 2. Synthesis results for a Xilinx Virtex 4
We have presented a generalised architecture for bilin-
XC4VSX55.
ear interpolation for use within vision applications imple-
mented on FPGAs. The architecture distributes source im-
placed and routed using Xilinx ISE 10.1. The target device age pixels across multiple memories which are then ac-
was a Xilinx Virtex 4 XC4VSX55 FPGA. The SX family cessed in parallel. By using intrinsic data in the sub-pixel
includes a higher proportion of embedded memories as re- addresses, interpolations can be computed using minimal
quired for this application. As has been mentioned above, hardware resources, freeing up most of the device’s logic
the logic requirements for the architecture are extremely fabric and embedded DSP blocks for implementing the re-
minimal, while the memory requirements are substantial for mainder of the vision system. The architecture achieved
an FPGA application. Considering the proposed context for a clock speed of 250MHz on a Xilinx Virtex 4. We also
this architecture, this platform is ideally suited, since a sig- suggested how the architecture could be modified for use
nificant portion of the logic fabric, a large number of em- with larger images that exceed the available memory on the
bedded DSP blocks, and a reasonable number of memory FPGA.
blocks are available for the implementation of the vision
portion of a complete system. References
In order to extract maximum performance from the de-
sign, it was highly pipelined, including at the inputs and [1] O-matrix webpage (http://www.omatrix.com/benchipt.pdf).
outputs to the embedded memories. Since the embedded [2] J. Baltes and J. Anderson. Interpolation methods for global
Block Memories are small (holding 18Kbits each), a single vision systems. In RoboCup 2004: Robot Soccer World Cup
system-level memory is in fact distributed throughout the VIII, 2004.
[3] S. Fahmy, C.-S. Bouganis, P. Cheung, and W. Luk. Real-
device fabric. Hence routing can become a problem if the
time hardware acceleration of the trace transform. In Journal
memory outputs are not registered. We found that the un- of Real-Time Image Processing, volume 2, pages 235–248,
registered version of the design had a maximum clock speed 2007.
of 133MHz. After registering, the clock speed increased to [4] R. Gonzalez and R. Woods. Digital Image Processing. Pear-
250MHz, or 250 Msamples/sec. son Prentice Hall, 2007.
The resource requirements are summarised in Table 2. [5] K. Gribbon and D. Bailey. A novel approach to real-time
At 250MHz, this architecture can process 512 × 512 pixel bilinear interpolation. In IEEE International Workshop on
images at speeds equivalent to 950 frames per second. The Electronic Design, Test and Applications, 2004.
[6] K. Gribbon, C. Johnston, and D. Bailey. A real-time FPGA
memory requirements of this architecture vary linearly with
implementation of a barrel distortion correction algorithm
image size. The largest Virtex 5 device, the XC5VSX240T, with bilinear interpolation. In Image and Vision Computing
contains sufficient memory for processing 1024 × 1024 New Zealand, pages 408–413, 2003.
pixel images. Clearly, for larger images that would re- [7] Z. Haihua, X. Zhiqi, and C. Guanghua. VLSI implemen-
quire external memory, the resultant speed would depend tation of sub-pixel interpolator for h.264/avc encoder. In
on the maximum access speeds for the external memory. International Symposium on High Density Packaging and
For comparison against software, we consider a tool called Microsystem Integration, 2007.
O-Matrix [1], that is used in industrial applications. It ex- [8] F. Kelly and A. Kokaram. Fast image interpolation for mo-
hibits a huge performance advantage over MATLAB for im- tion estimation using graphics hardware. In Proceedings of
age processing functions. For image rotation1 with bilinear the SPIE, pages 184–194, 2004.
[9] P. Uthaichana and E. Leelarasmee. A pipelined bilinear in-
interpolation, they claim performance of 15ms (as opposed
terpolation for real time video image expansion. In IEEE
to MATLAB’s 2000ms) for a 1280 × 960 image (equivalent Region 10 Conference, 2004.
to 8.2Mpixels/second). Our system is over 30 times faster. [10] Xilinx Inc. Virtex-5 User Guide, 2007.
We must also factor in that for more complicated systems [11] D. Zhou and P. Liu. A hardware-efficient dual-standard vlsi
the speedup over software will be more significant, since architecture for mc interpolation in avs and h.264. In IEEE
International Symposium on Circuits and Systems (ISCAS),
1 Image rotation is chosen as it is a very simple operation which uses 2007.
interpolation.

336

View publication stats

You might also like