ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The massive adoption of IoT devices, the recent developments in the efficiency of AI systems, and... more The massive adoption of IoT devices, the recent developments in the efficiency of AI systems, and the increase of edge computational power, accelerated the deployment of edge AI systems. The implementation of these systems through the use of low-power embedded devices scattered across the edges of a network allows for reduced latency and cost, compared to traditional cloud-based AI computing systems. As a result of the low-complexity AI models and the available low-power embedded systems on the market, this paper provides a comparative study on the inference performance of convolutional neural networks for different edge devices, by exploiting lowpower GPUs and dedicated AI hardware. The benchmark results were able to achieve 864 inferences/s for the Jetson AGX Xavier board on a pre-trained SqueezeNet, while reaching a high power efficiency of 52.6 inferences/s per Watt. For the dedicated Movidius neural stick, the system requires only 1.5 W for processing 24.2 inferences/s.
2021 International Conference on Graphics and Interaction (ICGI), 2021
With the ever-increasing demand for virtualizing every aspect of life, engineers design solutions... more With the ever-increasing demand for virtualizing every aspect of life, engineers design solutions to create immersion in virtual and augmented reality scenarios. Most virtual reality (VR) and augmented reality (AR) headsets require a connection to a computing system and thus they demand the acquisition of additional hardware to use the headsets. Following Moore's law, transistors have become smaller and computers more powerful. As a result, new hardware and programming languages have been developed to achieve high-performance graphics while requiring low power. This paper compares OpenGL and Vulkan implementations on Nvidia's Jetson development kits equipped with edge GPUs that can generate 3D graphics under 5W (Jetson Nano), 15W (Jetson TX2) and, 30W (Jetson Xavier), providing an in-house processing and cost-effective headset solution without an external processing unit. We report that efficiency can be 2 times higher than desktop graphics processing units (GPUs) while maintaining a reasonable amount of rendering power.
Edge applications evolved into a variety of scenarios that include the acquisition and compressio... more Edge applications evolved into a variety of scenarios that include the acquisition and compression of immense amounts of images acquired in space remote environments such as satellites and drones, where characteristics such as power have to be properly balanced with constrained memory and parallel computational resources. The CCSDS-123 is a standard for lossless compression of multispectral and hyperspectral images used in on-board satellites and military drones. This work explores the performance and power of 3 families of low-power heterogeneous Nvidia GPU Jetson architectures, namely the 128-core Nano, the 256-core TX2 and the 512-core Xavier AGX by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort, compared to the production of dedicated circuits, while maintaining low power. This solution parallelizes the predictor on the low-power GPU while the entropy encoders exploit the heterogeneous multiple CPU cores and the GPU con...
2021 55th Asilomar Conference on Signals, Systems, and Computers, 2021
This paper introduces a novel concept for exploiting low-power edge graphics processing units (GP... more This paper introduces a novel concept for exploiting low-power edge graphics processing units (GPUs) for decoding higher-order non-binary low-density parity-check (LDPC) codes within a good performance level. In the proposed remote system, we exploit the asynchronous and simultaneous use of CPU and GPU resources, time-encoded data streams, and the concept of multi-codeword decoding. We report a coding gain superior to 1dB compared to the binary counterpart for the optimal sum-product algorithm (SPA). We compare our proposed solution against dedicated application-specific integrated-circuit (ASIC) designs, showing that, although behind, the edge GPU is competitive in terms of performance and energy, while supporting a significantly reduced development effort. Moreover, the experiments confirm that the proposed edge architecture provides a promising framework for Galois fields of order up to 256 and also from short to moderate code length equivalent to the binary (128, 64) and (512, 256) codes, supporting efficient and low-latency remote processing, reaching 2 Mbit/s, in conformity with the CCSDS-231 standard, under a global 7W power budget.
2020 IEEE Workshop on Signal Processing Systems (SiPS), 2020
Signal processing hardware designers of Low-Density Parity-Check (LDPC) decoders used in modern o... more Signal processing hardware designers of Low-Density Parity-Check (LDPC) decoders used in modern optical communications are confronted with the need to perform multi-parametric design space exploration, targeting very high throughput (hundreds of Mbit/s) and low-power systems. This work addresses the needs of current designers of dedicated GF(2m) NB-LDPC decoders that necessitate robust approaches for dealing with the ever-increasing demand for higher BER performance. The constraints pose tremendous pressure on the on-chip design of irregular data structures and micro-circuit implementation for supporting the complex Galois field mathematics and communications of hundreds of check nodes with hundreds of variable node processors. We have developed kernels targeting GPU and FPGA (HLS and its equivalent RTL) descriptions of this class of complex circuits for comparing area, frequency of operation, latency, parallelism and throughput. Exploiting techniques such as using custom bit-widths, pipelining, loop-unrolling, array-partitioning and the replication of compute units, results in considerably faster design cycles and demands less non-recurring engineering effort. We report a throughput performance of 800 Mbps for the FPGA case.
2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020
It is commonly perceived that an HLS specification targeted for FPGAs cannot provide throughput p... more It is commonly perceived that an HLS specification targeted for FPGAs cannot provide throughput performance in par with equivalent RTL descriptions. In this work we developed a complex design of a non-binary LDPC decoder, that although hard to generalise, shows that HLS provides sufficient architectural refinement options. They allow attaining performance above CPU- and GPU-based ones and excel at providing a faster design cycle when compared to RTL development.
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020
This work explores the utilization of low-power heterogeneous devices for parallelizing the compu... more This work explores the utilization of low-power heterogeneous devices for parallelizing the compute-intensive hyper-spectral and multispectral image compression CCSDS-123 entropy encoders. Multithread processing allows for the near-optimal system’s bandwidth to be exploited increasing the system overall performance. The experimental platform consists of a low-power Jetson TX2 GPU equipped with an ARM Cortex-A57 and Denver 2 host processors, reporting more than 1552 Mb/s and, more importantly, 315 Mb/s/W, all running under a global 5 W power budget, which makes it a good candidate for onboard image compression.
Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Fa... more Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Faculdade de Ciências e TecnologiaO CCSDS 123 é um algoritmo de compressão de imagens hiperespectrais e multiespectrais composto por um preditor e um codificador. Normalmente, os sistemas que geram este tipo de imagens (satélites, drones, etc…) têm restrições energéticas. Este algoritmo é implementado, sobretudo em FPGAs devido ao seu baixo consumo energético. O mercado dos smartphones tem tornado os CPUs e GPUs em dispositivos energeticamente eficientes, colocando-os em posição de competir contra as FPGAs no campo de compressão de baixo consumo.O objetivo desta dissertação é, utilizando uma Jetson TX2, paralelizar o CCSDS-123. No preditor, quando a predição é intra-banda (P=0), é utilizado um único kernel. Quando se usa predição inter-banda (P>0), o preditor passa a ter dependências de dados dentro das bandas, tornando a paralelização menos eficiente e mais difícil de implementar. No codificador, que contém dependências de dados, são estudadas paralelizações utilizando vários dispositivos (CPU+GPU) nos dois codificadores contemplados nesta norma. Produzindo uma solução híbrida de computação heterogénea.As implementações são alvo de testes que compararam o tempo de execução paralela com os tempos execução em série de forma a identificar as melhores implementações. Ainda é feita uma análise energética medindo a potência utilizada pela placa ao longo do tempo de execução do algoritmo. No final, a taxa de débito e a eficiência energética são comparadas com o estado de arte.O uso de GPUs de baixo consumo traz um novo paradigma ao campo de compressão multiespectral e hiperespectral. Apesar de não tão eficientes como as FPGAs, GPUs conseguem altas taxas de débito.The CCSDS 123 is a hyperspectral and multispectral image compression algorithm composed of a predictor and an encoder. Usually, the systems that generate these types of images (satellites, drones, etc.) have energy restrictions. Hence, FPGAs show themselves as efficient devices to implement the CCSDS 123 due to its low energy consumption. The smartphone market has turned CPUs and GPUs into energy-efficient systems, making them potential competitors against FPGAs implementation dominance in the field of low-energy compression.The objective of this dissertation is, using a low-power GPU (Jetson TX2), to parallelize the CCSDS 123. Intra-band prediction (P=0) uses a single kernel. When using inter-band prediction (P>0), the predictor has data dependencies within bands, making parallelization less efficient and more challenging to implement. Hybrid parallelizations (CPU+GPU) are studied for the two encoders designed for this standard, producing a heterogeneous computing system.The implementations are subject to tests that compare the parallel execution times with the serial execution times in order to identify the best implementations. An energy analysis is performed, measuring the power used by the board over the algorithm's running time. In the end, the throughput rate and energy efficiency are compared with the state-of-the-art.The use of low-power graphics processing units (GPUs) brings a new paradigm to the field of multispectral and hyperspectral compression. Even though, not as the efficiency as FPGAs, GPUs deliver high throughput rates
The consultative committee for space data system (CCSDS)-123 is a standard for lossless compressi... more The consultative committee for space data system (CCSDS)-123 is a standard for lossless compression of multispectral and hyperspectral images with applications in on-board power-constrained systems, such as satellites and military drones. This letter explores the low-power heterogeneous architecture of the Nvidia Jetson TX2 by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort compared with the production of dedicated circuits, while maintaining low energy consumption. This solution parallelizes the predictor on a low-power graphics processing unit (GPU) while the encoders exploit the heterogeneous multiple cores of the CPUs and GPU concurrently. We report more than 16.6 Gb/s for the predictor and 1.4-Gb/s for the whole system, requiring less than 6.3 W and providing an efficiency of 245.6 Mb/s/W.
Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than... more Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than binary low-density parity-check (LDPC) codes when the codeword length is moderate and/or the channel has bursts of errors. The need for high-speed decoders for future digital communications led to the investigation of optimized NB-LDPC decoding algorithms and efficient implementations that target high throughput and low energy consumption levels. We carried out a comprehensive survey of existing NB-LDPC decoding hardware that targets the optimization of these parameters. Even though existing NB-LDPC decoders are optimized with respect to computational complexity and memory requirements, they still lag behind their binary counterparts in terms of throughput, power and area optimization. This study contributes to an overall understanding of the state-of-the-art on application-specific integrated-circuit (ASIC), field-programmable gate array (FPGA)
Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than... more Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than binary low-density parity-check (LDPC) codes when the codeword length is moderate and/or the channel has bursts of errors. The need for high-speed decoders for future digital communications led to the investigation of optimized NB-LDPC decoding algorithms and efficient implementations that target high throughput and low energy consumption levels. We carried out a comprehensive survey of existing NB-LDPC decoding hardware that targets the optimization of these parameters. Even though existing NB-LDPC decoders are optimized with respect to computational complexity and memory requirements, they still lag behind their binary counterparts in terms of throughput, power and area optimization. This study contributes to an overall understanding of the state-of-the-art on application-specific integrated-circuit (ASIC), field-programmable gate array (FPGA)
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The massive adoption of IoT devices, the recent developments in the efficiency of AI systems, and... more The massive adoption of IoT devices, the recent developments in the efficiency of AI systems, and the increase of edge computational power, accelerated the deployment of edge AI systems. The implementation of these systems through the use of low-power embedded devices scattered across the edges of a network allows for reduced latency and cost, compared to traditional cloud-based AI computing systems. As a result of the low-complexity AI models and the available low-power embedded systems on the market, this paper provides a comparative study on the inference performance of convolutional neural networks for different edge devices, by exploiting lowpower GPUs and dedicated AI hardware. The benchmark results were able to achieve 864 inferences/s for the Jetson AGX Xavier board on a pre-trained SqueezeNet, while reaching a high power efficiency of 52.6 inferences/s per Watt. For the dedicated Movidius neural stick, the system requires only 1.5 W for processing 24.2 inferences/s.
2021 International Conference on Graphics and Interaction (ICGI), 2021
With the ever-increasing demand for virtualizing every aspect of life, engineers design solutions... more With the ever-increasing demand for virtualizing every aspect of life, engineers design solutions to create immersion in virtual and augmented reality scenarios. Most virtual reality (VR) and augmented reality (AR) headsets require a connection to a computing system and thus they demand the acquisition of additional hardware to use the headsets. Following Moore's law, transistors have become smaller and computers more powerful. As a result, new hardware and programming languages have been developed to achieve high-performance graphics while requiring low power. This paper compares OpenGL and Vulkan implementations on Nvidia's Jetson development kits equipped with edge GPUs that can generate 3D graphics under 5W (Jetson Nano), 15W (Jetson TX2) and, 30W (Jetson Xavier), providing an in-house processing and cost-effective headset solution without an external processing unit. We report that efficiency can be 2 times higher than desktop graphics processing units (GPUs) while maintaining a reasonable amount of rendering power.
Edge applications evolved into a variety of scenarios that include the acquisition and compressio... more Edge applications evolved into a variety of scenarios that include the acquisition and compression of immense amounts of images acquired in space remote environments such as satellites and drones, where characteristics such as power have to be properly balanced with constrained memory and parallel computational resources. The CCSDS-123 is a standard for lossless compression of multispectral and hyperspectral images used in on-board satellites and military drones. This work explores the performance and power of 3 families of low-power heterogeneous Nvidia GPU Jetson architectures, namely the 128-core Nano, the 256-core TX2 and the 512-core Xavier AGX by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort, compared to the production of dedicated circuits, while maintaining low power. This solution parallelizes the predictor on the low-power GPU while the entropy encoders exploit the heterogeneous multiple CPU cores and the GPU con...
2021 55th Asilomar Conference on Signals, Systems, and Computers, 2021
This paper introduces a novel concept for exploiting low-power edge graphics processing units (GP... more This paper introduces a novel concept for exploiting low-power edge graphics processing units (GPUs) for decoding higher-order non-binary low-density parity-check (LDPC) codes within a good performance level. In the proposed remote system, we exploit the asynchronous and simultaneous use of CPU and GPU resources, time-encoded data streams, and the concept of multi-codeword decoding. We report a coding gain superior to 1dB compared to the binary counterpart for the optimal sum-product algorithm (SPA). We compare our proposed solution against dedicated application-specific integrated-circuit (ASIC) designs, showing that, although behind, the edge GPU is competitive in terms of performance and energy, while supporting a significantly reduced development effort. Moreover, the experiments confirm that the proposed edge architecture provides a promising framework for Galois fields of order up to 256 and also from short to moderate code length equivalent to the binary (128, 64) and (512, 256) codes, supporting efficient and low-latency remote processing, reaching 2 Mbit/s, in conformity with the CCSDS-231 standard, under a global 7W power budget.
2020 IEEE Workshop on Signal Processing Systems (SiPS), 2020
Signal processing hardware designers of Low-Density Parity-Check (LDPC) decoders used in modern o... more Signal processing hardware designers of Low-Density Parity-Check (LDPC) decoders used in modern optical communications are confronted with the need to perform multi-parametric design space exploration, targeting very high throughput (hundreds of Mbit/s) and low-power systems. This work addresses the needs of current designers of dedicated GF(2m) NB-LDPC decoders that necessitate robust approaches for dealing with the ever-increasing demand for higher BER performance. The constraints pose tremendous pressure on the on-chip design of irregular data structures and micro-circuit implementation for supporting the complex Galois field mathematics and communications of hundreds of check nodes with hundreds of variable node processors. We have developed kernels targeting GPU and FPGA (HLS and its equivalent RTL) descriptions of this class of complex circuits for comparing area, frequency of operation, latency, parallelism and throughput. Exploiting techniques such as using custom bit-widths, pipelining, loop-unrolling, array-partitioning and the replication of compute units, results in considerably faster design cycles and demands less non-recurring engineering effort. We report a throughput performance of 800 Mbps for the FPGA case.
2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020
It is commonly perceived that an HLS specification targeted for FPGAs cannot provide throughput p... more It is commonly perceived that an HLS specification targeted for FPGAs cannot provide throughput performance in par with equivalent RTL descriptions. In this work we developed a complex design of a non-binary LDPC decoder, that although hard to generalise, shows that HLS provides sufficient architectural refinement options. They allow attaining performance above CPU- and GPU-based ones and excel at providing a faster design cycle when compared to RTL development.
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020
This work explores the utilization of low-power heterogeneous devices for parallelizing the compu... more This work explores the utilization of low-power heterogeneous devices for parallelizing the compute-intensive hyper-spectral and multispectral image compression CCSDS-123 entropy encoders. Multithread processing allows for the near-optimal system’s bandwidth to be exploited increasing the system overall performance. The experimental platform consists of a low-power Jetson TX2 GPU equipped with an ARM Cortex-A57 and Denver 2 host processors, reporting more than 1552 Mb/s and, more importantly, 315 Mb/s/W, all running under a global 5 W power budget, which makes it a good candidate for onboard image compression.
Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Fa... more Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Faculdade de Ciências e TecnologiaO CCSDS 123 é um algoritmo de compressão de imagens hiperespectrais e multiespectrais composto por um preditor e um codificador. Normalmente, os sistemas que geram este tipo de imagens (satélites, drones, etc…) têm restrições energéticas. Este algoritmo é implementado, sobretudo em FPGAs devido ao seu baixo consumo energético. O mercado dos smartphones tem tornado os CPUs e GPUs em dispositivos energeticamente eficientes, colocando-os em posição de competir contra as FPGAs no campo de compressão de baixo consumo.O objetivo desta dissertação é, utilizando uma Jetson TX2, paralelizar o CCSDS-123. No preditor, quando a predição é intra-banda (P=0), é utilizado um único kernel. Quando se usa predição inter-banda (P>0), o preditor passa a ter dependências de dados dentro das bandas, tornando a paralelização menos eficiente e mais difícil de implementar. No codificador, que contém dependências de dados, são estudadas paralelizações utilizando vários dispositivos (CPU+GPU) nos dois codificadores contemplados nesta norma. Produzindo uma solução híbrida de computação heterogénea.As implementações são alvo de testes que compararam o tempo de execução paralela com os tempos execução em série de forma a identificar as melhores implementações. Ainda é feita uma análise energética medindo a potência utilizada pela placa ao longo do tempo de execução do algoritmo. No final, a taxa de débito e a eficiência energética são comparadas com o estado de arte.O uso de GPUs de baixo consumo traz um novo paradigma ao campo de compressão multiespectral e hiperespectral. Apesar de não tão eficientes como as FPGAs, GPUs conseguem altas taxas de débito.The CCSDS 123 is a hyperspectral and multispectral image compression algorithm composed of a predictor and an encoder. Usually, the systems that generate these types of images (satellites, drones, etc.) have energy restrictions. Hence, FPGAs show themselves as efficient devices to implement the CCSDS 123 due to its low energy consumption. The smartphone market has turned CPUs and GPUs into energy-efficient systems, making them potential competitors against FPGAs implementation dominance in the field of low-energy compression.The objective of this dissertation is, using a low-power GPU (Jetson TX2), to parallelize the CCSDS 123. Intra-band prediction (P=0) uses a single kernel. When using inter-band prediction (P>0), the predictor has data dependencies within bands, making parallelization less efficient and more challenging to implement. Hybrid parallelizations (CPU+GPU) are studied for the two encoders designed for this standard, producing a heterogeneous computing system.The implementations are subject to tests that compare the parallel execution times with the serial execution times in order to identify the best implementations. An energy analysis is performed, measuring the power used by the board over the algorithm's running time. In the end, the throughput rate and energy efficiency are compared with the state-of-the-art.The use of low-power graphics processing units (GPUs) brings a new paradigm to the field of multispectral and hyperspectral compression. Even though, not as the efficiency as FPGAs, GPUs deliver high throughput rates
The consultative committee for space data system (CCSDS)-123 is a standard for lossless compressi... more The consultative committee for space data system (CCSDS)-123 is a standard for lossless compression of multispectral and hyperspectral images with applications in on-board power-constrained systems, such as satellites and military drones. This letter explores the low-power heterogeneous architecture of the Nvidia Jetson TX2 by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort compared with the production of dedicated circuits, while maintaining low energy consumption. This solution parallelizes the predictor on a low-power graphics processing unit (GPU) while the encoders exploit the heterogeneous multiple cores of the CPUs and GPU concurrently. We report more than 16.6 Gb/s for the predictor and 1.4-Gb/s for the whole system, requiring less than 6.3 W and providing an efficiency of 245.6 Mb/s/W.
Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than... more Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than binary low-density parity-check (LDPC) codes when the codeword length is moderate and/or the channel has bursts of errors. The need for high-speed decoders for future digital communications led to the investigation of optimized NB-LDPC decoding algorithms and efficient implementations that target high throughput and low energy consumption levels. We carried out a comprehensive survey of existing NB-LDPC decoding hardware that targets the optimization of these parameters. Even though existing NB-LDPC decoders are optimized with respect to computational complexity and memory requirements, they still lag behind their binary counterparts in terms of throughput, power and area optimization. This study contributes to an overall understanding of the state-of-the-art on application-specific integrated-circuit (ASIC), field-programmable gate array (FPGA)
Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than... more Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than binary low-density parity-check (LDPC) codes when the codeword length is moderate and/or the channel has bursts of errors. The need for high-speed decoders for future digital communications led to the investigation of optimized NB-LDPC decoding algorithms and efficient implementations that target high throughput and low energy consumption levels. We carried out a comprehensive survey of existing NB-LDPC decoding hardware that targets the optimization of these parameters. Even though existing NB-LDPC decoders are optimized with respect to computational complexity and memory requirements, they still lag behind their binary counterparts in terms of throughput, power and area optimization. This study contributes to an overall understanding of the state-of-the-art on application-specific integrated-circuit (ASIC), field-programmable gate array (FPGA)
Uploads
Papers by Oscar Ferraz