Papers by Leandro Marzulo

Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), 2019
Problemas de otimização são de grande importância para diversos setores da indústria, desde o pla... more Problemas de otimização são de grande importância para diversos setores da indústria, desde o planejamento de produção até escoamento e transporte de produtos. Diversos problemas de interesse se enquadram na classe NP-Difícil, sendo desconhecidos algoritmos para resolvê-los de forma exata em tempo polinomial. Assim, estratégias heurísticas com capacidade de escapar de ótimos locais de baixa qualidade (meta-heurísticas) são geralmente empregadas. A busca local é, em geral, a etapa mais custosa, em termos de tempo computacional, do processo de uma meta-heurística. Desta forma torna-se muito importante fazer bom uso dos recursos nela utilizados. Esta dissertação estuda o emprego de múltiplas estratégias de vizinhança utilizadas paralelamente para explorar um espaço de vizinhança maior e com melhor aproveitamento dos recursos computacionais. O processamento paralelo das estratégias de vizinhança é implementado em nível de grão fino, através de processamento em GPU, e grão grosso, por me...

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018
This paper proposes a dataflow implementation for a local search to solve the Minimum Latency Pro... more This paper proposes a dataflow implementation for a local search to solve the Minimum Latency Problem (MLP), a variant of the Traveling Salesman Problem (TSP). Since the problem is NP-Hard, best results in literature report the use of metaheuristic strategies, mainly based on the concept of variable neighborhoods. The dataflow architecture was proposed in the 70's with programs represented as dependency graphs, but von Neumann architecture became the standard computing platform and dataflow has been only considered for theoretical experiments. Many state-of-the-art metaheuristics are harnessing computational power from emerging heterogeneous computing platforms, such as Graphics Processing Units (GPU), requiring to rethink some ideas of classic optimization algorithms in order to properly explore the architecture. We propose a hybrid dataflow architecture (simulated over CPU), where each node contains a GPU implementation that enumerates a neighborhood for the problem. The dataflow architecture uses a distributed network that provides scalability for solving large MLP instances, where each neighborhood exploration is part of a state-of-the-art Distributed Variable Neighborhood Descent (DVND). The whole scenario yield an heterogeneous multi-level parallelization approach that can be used to solve time consuming problems, not being coupled to specific instance or problem.
Parallel Computing, 2020
This is a PDF file of an article that has undergone enhancements after acceptance, such as the ad... more This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Electronic Notes in Discrete Mathematics, 2018
The Traveling Thief Problem (TTP) is a multi-component combinatorial optimization problem that co... more The Traveling Thief Problem (TTP) is a multi-component combinatorial optimization problem that combines two well-known problems in the literature: the Traveling Salesman Problem (TSP) and the Knapsack Problem (KP). This paper proposes a novel list-constrained local search process inspired in Variable Neighborhood Descent (VND) for multiple neighborhood structures, combined with a metaheuristic Greedy Randomized Adaptive Search Procedure (GRASP). The local search implementation was made in a Graphics Processing Unit (GPU) architecture in order to explore the massive number of computing cores to simultaneously explore neighbor solutions, while the GRASP was implemented exploring the natural parallelism of a multi-core CPU. The computational results were compared to state-of-the-art results in literature and indicate promising research directions for the design of novel search algorithms in high performance architectures.

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
With the increase of the search for computational models where the expression of parallelism occu... more With the increase of the search for computational models where the expression of parallelism occurs naturally, some paradigms arise as options for the next generation of computers. In this context, dynamic Dataflow and Gamma-General Abstract Model for Multiset mAnipulation)-emerge as interesting computational models choices. In the dynamic Dataflow model, operations are performed as soon as their associated operators are available, without rely on a Program Counter to dictate the execution order of instructions. The Gamma paradigm is based on a parallel multiset rewriting scheme. It provides a non-deterministic execution model inspired by an abstract chemical machine metaphor, where operations are formulated as reactions that occur freely among matching elements belonging to the multiset. In this work, equivalence relations between the dynamic Dataflow and Gamma paradigms are exposed and explored, while methods to convert from Dataflow to Gamma paradigm and vice versa are provided. It is shown that vertices and edges of a dynamic Dataflow graph can correspond, respectively, to reactions and multiset elements in the Gamma paradigm. Implementation aspects of execution environments that could be mutually beneficial to both models are also discussed. This work provides the scientific community with the possibility of taking profit of both parallel programming models, contributing with a versatility component to researchers and developers. Finally, it is important to state that, to the best of our knowledge, the similarity relations between both dynamic Dataflow and Gamma models presented here have not been reported in any previous work.

2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)
Physically Unclonable Functions (PUFs) are hardware-based security primitives that promise to pro... more Physically Unclonable Functions (PUFs) are hardware-based security primitives that promise to provide an advantage in terms of area and power compared to hardware implementations of standard cryptography algorithms. PUFs harness manufacturing process variations to realize binary keys (Weak PUFs) or binary functions (SStrong PUFs). An ideal Strong PUF realizes a binary function that maps an m-bit input challenge to a random n-bit output response and offers an exponential number of such unique challenge-response pairs (CRPs). Hence, it is attractive for authentication applications. Unfortunately, most Strong PUF implementations are non-ideal, where an adversary can build a machine-learning model by observing a relatively few CRPs, making it possible to predict the output response of a PUF to a future challenge. Existence of such a model, or clone, constitutes a breach of security. In this paper, we make two contributions: first, we demonstrate that by leveraging a Weightless Neural Network (WNN), we can realize a CMOS Strong PUF from a Weak PUF. Next, we demonstrate that WNN based Strong PUFs offer robust resistance to machine-learning, while also delivering on uniqueness and reliability metrics — bringing it closer to an ideal Strong PUF. Neural network hardware is gaining importance for pattern matching and classification. This work demonstrates how such a design may be re-purposed for security. In the rest of the paper, we present architecture, practical implementation and analysis of Neural Network based PUFs.

2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)
As processor manufacturing companies shifted to chips with an ever-increasing number of cores, cr... more As processor manufacturing companies shifted to chips with an ever-increasing number of cores, creating a tangible way for average programmers to exploit parallelism became imperative. The scientific community is in a quest to create programming models that would make it easier to describe tasks and interaction between them. On the other hand, as the number of cores increases, so does the chance of having a fault in a core, so it is also important to provide resiliency to these programming models. DFER was shown to be a good fit to take advantage of dataflow programming while introducing resiliency to transient faults inside dataflow task execution. However, although most of the computing time of the dataflow system is spent in task execution, it is also desirable to provide fault tolerance in scheduling operations. This paper introduces novel techniques that incorporate a level of resiliency to the dataflow task scheduler in DFER. Experiments with two different approaches for achieving resiliency in the scheduler show promising results that take DFER one step further towards reliability.

Proceedings of the 17th ACM International Conference on Computing Frontiers, 2020
Dynamic Information Flow Tracking has been successfully used to prevent a wide range of attacks a... more Dynamic Information Flow Tracking has been successfully used to prevent a wide range of attacks and detect illegal access to sensitive information. Most proposed solutions only track the explicit information flow where the taint is propagated through data dependencies. However, recent evasion attacks exploit implicit flows, that use control flow in the application, to manipulate the data thus making the malicious activity undetectable. We propose NIFT - a nested implicit flow tracking mechanism that extends explicit propagation to instructions affected by a control dependency. Our technique generates taint instructions at compile time which are executed by specialized hardware to propagate taint implicitly even in cases of deeply-nested branches. In addition, we propose a restricted taint propagation for data executed in conditional branches that affects only immediate instructions instead of all instructions inside the branch scope. Our technique efficiently locates implicit flows and resolves them with negligible performance overhead. Moreover, it mitigates the over-tainting problem.

Anais do X Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD 2009), 2009
No modelo DataFlow as instruções são executadas t ão logo seus operandos de entrada estejam dispo... more No modelo DataFlow as instruções são executadas t ão logo seus operandos de entrada estejam disponíveis, expondo, de forma natural, o paralelismo em nível de instrução (ILP). Por outro lado, a exploração de paralelismo em nível de thread (TLP) passa a ser também um fator de grande import ância para o aumento de desempenho na execução de uma aplicação em máquinas multicore. Este trabalho propõe um modelo de execução de programas, baseado nas arquiteturas DataFlow, que transforma ILP em TLP. Esse modelo é demonstrado através da implementação de uma máquina virtual multi-threaded, a Trebuchet. A aplicação é compilada para o modelo DataFlow e suas instruções independentes (segundo o fluxo de dados) são executadas em Elementos de Processamento (EPs) distintos da Trebuchet. Cada EP é mapeado em uma thread na máquina hospedeira. O modelo permite a definição de blocos de instruções de diferentes granularidades, que terão disparo guiado pelo fluxo de dados e execução direta na máquina hosped...

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), 2017
In the dataflow computation model, instructions or tasks are fired according to their data depend... more In the dataflow computation model, instructions or tasks are fired according to their data dependencies, instead of following program order, thus allowing natural parallelism exploitation. Dataflow has been used, in different flavors and abstraction levels (from processors to runtime libraries), as an interesting alternative for harnessing the potential of modern computing systems. Sucuri is a dataflow library for Python that allows users to specify their application as a dependency graph and execute it transparently at clusters of multicores, while taking care of scheduling issues. Recent trends in Fog and In-situ computing assumes that storage and network devices will be equipped with processing elements that usually have lower power consumption and performance. An important decision on such system is whether to move data to traditional processors (paying the communication costs), or performing computation where data is sitting, using a potentially slower processor. Hence, runtime...

IEEE Transactions on Emerging Topics in Computing, 2021
Dynamic dataflow scheduling enables effective exploitation of concurrency while making parallel p... more Dynamic dataflow scheduling enables effective exploitation of concurrency while making parallel programming easier. To this end, analyzing the inherent degree of concurrency available in dataflow graphs is an important task, since it may aid compilers or programmers to assess the potential performance a program can achieve via parallel execution. However, traditional concurrency analysis techniques only work for DAGs (directed acyclic graphs), hence the need for new techniques that contemplate graphs with cycles. In this paper we present techniques to perform concurrency analysis on generic dynamic dataflow graphs, even in the presence of cycles. In a dataflow graph, nodes represent instructions and edges describe dependencies. The novelty of our approach is that we allow concurrency between different iterations of the loops. Consequently, a set of concurrent nodes may contain instructions from different loops that can be proven independent. In this work, we provide a set of theoretical tools for obtaining bounds and illustrate implementation of parallel dataflow runtime on a set of representative graphs for important classes of benchmarks to compare measured performance against derived bounds.

Concurrency and Computation: Practice and Experience, 2021
With the increase of the search for computational models where the expression of parallelism occu... more With the increase of the search for computational models where the expression of parallelism occurs naturally, some paradigms arise as options for the current generation of computers. In this context, dynamic dataflow and Gamma—General Abstract Model for Multiset mAnipulation—emerge as interesting computational model choices. In dynamic dataflow model, operations are performed as soon as their associated operands are available, without rely on a Program Counter to dictate the execution order of instructions. The Gamma paradigm is based on a parallel multiset rewriting scheme. It provides a nondeterministic execution model inspired by an abstract chemical machinemetaphor, where operations are formulated as reactions that occur freely among matching elements belonging to the multiset. In this work, equivalence relations between the dynamic dataflow and Gamma paradigms are exposed and explored, while methods to convert from dataflow to Gamma paradigm and vice versa are provided. It is ...

Journal of Hardware and Systems Security, 2019
Physically unclonable functions (PUFs) have been explored as lightweight hardware primitives for ... more Physically unclonable functions (PUFs) have been explored as lightweight hardware primitives for the purpose of realizing robust security via strong authentication or secure key/ID generation. PUF harness manufacturing process variations for the purpose of generating binary keys or binary functions. An ideal strong PUF is a binary function that maps an m-bit input challenge to an unique n-bit output response, making it attractive for authentication applications. Unfortunately, real strong PUF implementations suffer from reliability issues where the same challenge may produce different responses in the presence of noise. To overcome this problem, strong PUF leverages the availability of exponential number of challengeresponse pairs (CRPs). A successful authentication event requires acquiring multiple CRPs and applying a threshold. In contrast, weak PUFs produce limited keys and are required to be highly reliable. Multiple techniques have been developed to achieve the necessary reliability. An additional prerequisite for strong PUFs is resilience against model-building attacks (cloning) by an adversary, who has observed a few CRPs, to prevent successful prediction of future CRPs. In this work, we first illustrate a strong PUF design that re-purposes a weightless neural network (WNN). Second, we showcase the robustness of WNN-based strong PUFs with respect to machine learning attacks, while providing desirable uniqueness and reliability metrics. Finally, we employ an initial entropy source of highly reliable weak PUF bits mapped to weightless neural networks (WNNs) for the purpose of creating a near-ideal strong PUF in terms of reliability. Our results show that it is possible to create highly reliable WNN-based strong PUFs with < 65 % ML accuracy by using as few as 32 initial reliable weak PUF bits.
International Journal of Grid and Utility Computing, 2019

Concurrency and Computation: Practice and Experience, 2018
SummaryIn a program, there is usually a significant amount of instructions that are repeatedly ex... more SummaryIn a program, there is usually a significant amount of instructions that are repeatedly executed with the same inputs during the execution. This redundancy allows the reuse of previous computations, potentially reducing the program execution time. The Dynamic Trace Memoization technique (DTM) was proposed to exploit the reuse of a dynamic sequence of redundant instructions for superscalar CPUs. This paper proposes the application of the DTM technique on a GPU architecture. We propose the DTM@GPU model that adapts the original DTM technique to the NVIDIA GPU architecture by introducing architectural modifications and the identification of different trace reuse styles in multithreaded environments. We investigate reuse opportunities in real‐world GPU applications and the potential performance gains. We also perform a detailed investigation on the characteristics of the reused traces. This characterization shows the number and size of the reused traces, the influence of the cach...

Concurrency and Computation: Practice and Experience, 2018
SummaryInstruction Reuse is a technique adopted in Von Neumann architectures that improves perfor... more SummaryInstruction Reuse is a technique adopted in Von Neumann architectures that improves performance by avoiding redundant execution of instructions when the result to be produced can be obtained by searching an input/output memoization table for such instruction. Trace reuse can be applied to traces of instructions in a similar fashion. However, those techniques are yet to be studied in the context of the Dataflow model, which has been gaining traction in the high performance computing community due to its inherent parallelism. Dataflow programs are represented by directed graphs where nodes are instructions or tasks and edges denote data dependencies between tasks. This work presents Dataflow Dynamic Task Memoization (DF‐DTM), a technique that allows the reuse of both nodes and subgraphs in dataflow, which are analogous to instructions and traces, respectively. The potential of DF‐DTM is evaluated by a series of experiments that analyze the behavior of redundant tasks in five re...

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), 2016
Sucuri is a minimalistic Python library that provides dataflow programming through a reasonably s... more Sucuri is a minimalistic Python library that provides dataflow programming through a reasonably simple syntax. It allows transparent execution on computer clusters and natural exploitation of parallelism. In Sucuri, programmers instantiate a dataflow graph, where each node is assigned to a function and edges represent data dependencies between nodes. The original implementation of Sucuri adopts a centralized scheduler, which incurs high communication overheads, specially in clusters with a large number of machines. In this paper we modify Sucuri so that each machine in a cluster will have its own scheduler. Before execution, the dataflow graph is partitioned, so that nodes can be distributed among the machines of the cluster. In runtime, idle workers will grab tasks from a ready queue in their local scheduler. Experimental results confirm that the solution can reduce communication overheads, improving performance in larger clusters.

IET Circuits, Devices & Systems, 2017
Dynamic dataflow allows simultaneous execution of instructions in different iterations of a loop,... more Dynamic dataflow allows simultaneous execution of instructions in different iterations of a loop, boosting parallelism exploitation. In this model, operands are tagged with their associated instance number, which is incremented as they go through the loop. Instruction execution is triggered when all input operands with the same tag become available. However, this traditional tagging mechanism often requires the generation of several control instructions to manipulate tags and guarantee the correct match. To address this problem, this work presents three dataflow loop optimisation techniques. The stack-tagged dataflow is a tagging mechanism that uses stacks of tags to reduce control overheads in dataflow. On the other hand, as nested loops may increase the overhead of stack-tag comparison, tag resetting can be used to set the tag to zero whenever it is safe, allowing a one-level reduction at the stack depth. Finally, loop skipping allows to further avoid stack comparison overhead in loops, when the number of iterations can be determined by the compiler. Experimental results show the overhead, drawbacks and benefits for the three optimisations presented. Moreover, the results suggested that a hybrid compiling approach can be used to get the best performance of each technique.

2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2015
Linear Algebra Kernels have an important role in many petroleum reservoir simulators, extensively... more Linear Algebra Kernels have an important role in many petroleum reservoir simulators, extensively used by the industry. The growth in problem size, specially in pre-salt exploration, has caused an increase in execution time of those kernels, thus requiring parallel programming to improve performance and make the simulation viable. On the other hand, exploiting parallelism in systems with an ever increasing number of cores may be an arduous task, as the programmer has to manage threads and care about synchronization issues. Current work on parallel programming models show that Dataflow Execution exploits parallelism in a natural way, allowing the programmer to focus solely on describing dependencies between portions of code. This work consists in implementing parallel Linear Algebra Kernels using the Dataflow model. The Trebuchet Dataflow Virtual Machine and the Sucuri Dataflow Library were used to evaluate the solutions with real inputs from reservoir simulators. Results have been compared with OpenMP and Intel Math Kernel Library and show that coarser-grained tasks are needed to hide the overheads of dataflow runtime environments. Therefore, level 2 and 3 linear algebra operations, such as Vector-Matrix and Matrix-Matrix products, presented the most promising results.
Uploads
Papers by Leandro Marzulo