Papers by Pierfrancesco Foglia
A solution adopted in the past to design high performance multiprocessors systems that were scala... more A solution adopted in the past to design high performance multiprocessors systems that were scalable with respect to the number of cpus was the design of Distributed Shared Memory (DSM) multiprocessor with coherent cache, whose coherence was held by a directory-based coherence protocol. Such solution permits to have high level of performance also with high numbers of processors (512 or more). Modern systems are able to put two or more processors on the same die (Chip Multiprocessors, CMP), each with its private caches, while the last level caches can be either private or shared. As these systems are affected by the wire delay problem, NUCA caches have been proposed to hide the effects of such delay in order to increase performance. A CMP system that adopt a NUCA as its shared last level cache has to be able to maintain coherence among the lowest, private levels of the cache hierarchy. As future generation systems are expected to have more then 500 cores per chip, a way to guarantee a high level of scalability is adopting a directory coherence protocol, similar to the ones that characterized DSM systems. Previous works focusing on NUCA-based CMP systems adopt a fixed topology (i.e. physical position of cores and NUCA banks, and the communication infrastructure) for their system and the coherence protocol is either MESI or MOESI, without motivating the reasons of such choices. In this paper, we present an evaluation of an 8-cpu CMP system with two levels of cache, in which the L1s are private of each core, while the L2 is a Static-NUCA shared among all cores. We considered three different system topologies (the first with the eight cpus connected to the NUCA at the same side, the second with half of the cpus on one side and the others at the opposite side, the third with two cpus on each side), and for all the topologies we considered MESI and MOESI. Our preliminary results show that processor topology has more effect on performance and NOC bandwidth occupancy than the coherence protocol.
Multi-objective optimization of water distribu tibn networks Two multi-objective approaches to th... more Multi-objective optimization of water distribu tibn networks Two multi-objective approaches to the consideration of pipe breakage data in water distribution network designs are formulated. Both models are based on the constraint method for multi-objective analysis. One model analyses the relationship between initial capital cost and subsequent repair and maintenance costs. Pipe breakage data is used to restrict the repair costs permitted in the system. The other model examines the relationships between initial pipe costs and the reliability of the pipes within the distribution network. In this second model. both the worst case and average system performance are examined in relation t o the cost making model a three-objectwe approach. The pipe breakage data is u s e d t o restrict the expected number of failures allowed in any link. The actual number of expected breaks occurring in each link is then used to develop Poisson-based probabilities of node isolation. Application of the two approaches shows that the information obtained from such multi-objective approaches gives improved understanding into the nature of the issues behind initial cost and repalr cost and initial cost andsystem reliability.
MEDEA is a half day workshop that wants to be a forum for academic and industrial people to excha... more MEDEA is a half day workshop that wants to be a forum for academic and industrial people to exchange ideas and experience on memory architectures for general-purpose, commercial and embedded systems. Main topics are memory architecture and memory-related performance/power issues, as well as memory management and optimization themes, considering system architecture and application domain in a joint manner. The program presents works on memory organization, performance and power in various kinds of systems (e.g. vector and heterogeneous CMP), and works on memory management on CMP architectures
In this work, we analyze how a DSS (Decision Support System) workload can be accelerated in the c... more In this work, we analyze how a DSS (Decision Support System) workload can be accelerated in the case of a shared-bus shared-memory multiprocessor, by adding simple support to the classical MESI solution for the coherence protocol. The DSS workload has been setup utilizing the TPC-D benchmark on the PostgreSQL DBMS. Analysis has been performed via trace driven simulation and the operating system effects are also considered in our evaluation. We analyzed a basic four-processor and a high-end sixteen-processor machine, implementing MESI and two coherence protocols which deal with migration of processes and data: PSCR and AMSD. Results show that, even in the four processor case, for a DSS workload the use of a write-update protocol with a selective invalidation strategy for private data improves performance (and scalability) with respect to a classical MESI based solution, because of the access pattern to shared data and the lower bus utilization due to the absence of invalidation miss when we eliminate the contribution of passive sharing. In the 16 processor case, and especially in situation when the scheduler cannot apply the affinity requirements, the gain becomes more important: the advantage of a write-update protocol with a selective invalidation strategy for private data, in term of execution time, could be quantified in a 20% relatively to the other evaluated protocols. This advantage is about 50% in the case of high cache-to-cache transfer latency.
Computer architecture news, Sep 17, 2005
Because of the increasing complexity of embedded systems, the related design process is becoming ... more Because of the increasing complexity of embedded systems, the related design process is becoming more and more complex and time-consuming. In this setting, the employment of standard tools and methodologies could significantly support designers in reducing time to market as well. In this paper we present our experience in the design space exploration for devices based on H.264 video coders. Despite of the inevitable inaccuracies due to the adoption of a system-level approach, the overall methodology has shown to be suitable to properly point out the most convenient architectural solutions by means of fast, high level simulation.
Lecture Notes in Computer Science, 2010
Mining Intelligence and Knowledge Exploration, 2018
Pattern recognition in financial time series is not a trivial task, due to level of noise, volati... more Pattern recognition in financial time series is not a trivial task, due to level of noise, volatile context, lack of formal definitions and high number of pattern variants. A current research trend involves machine learning techniques and online computing. However, medium-term trading is still based on humancentric heuristics, and the integration with machine learning support remains relatively unexplored. The purpose of this study is to investigate potential and perspectives of a novel architectural topology providing modularity, scalability and personalization capabilities. The proposed architecture is based on the concept of Receptive Fields (RF), i.e., sub-modules focusing on specific patterns, that can be connected to further levels of processing to analyze the price dynamics on different granularities and different abstraction levels. Both Multilayer Perceptrons (MLP) and Support Vector Machines (SVM) have been experimented as a RF. Early experiments have been carried out over the FTSE-MIB index.
D-NUCA L2 caches are able to tolerate the increasing wire delay effects due to technology scaling... more D-NUCA L2 caches are able to tolerate the increasing wire delay effects due to technology scaling thanks to their banked organization, broadcast line search and data promotion/demotion mechanism. Data promotion mechanism aims at moving frequently accessed data near the core, but causes additional accesses on cache banks, hence increasing dynamic energy consumption. It is shown how, in some cases, this migration mechanism is not successful in reducing data access latency and can be selectively and dynamically inhibited, thus reducing dynamic energy consumption without affecting performances.
Encyclopedia of Parallel Computing, 2011
Bonded computes Non-bonded computes NAMD (NAnoscale Molecular Dynamics). Fig. Flow diagram show... more Bonded computes Non-bonded computes NAMD (NAnoscale Molecular Dynamics). Fig. Flow diagram showing the computation of forces in NAMD every step N NAMD (NAnoscale Molecular Dynamics) Related Entries Anton, a Special-Purpose Molecular Simulation Machine Charm++ N-Body Computational Methods Bibliographic Notes and Further Reading A good survey of parallelization techniques for MD programs can be found in Plimpton et al. []. Snir discusses the communication requirements of force and spatial decomposition and proposes a hybrid algorithm similar to that of NAMD []. Other MD packages similar to NAMD are CHARMM, AMBER, GROMACS, Blue Matter, and Desmond. The paper by Kale et al. [] was awarded the Gordon Bell award at Supercomputing . Detailed performance benchmarking of NAMD and recent algorithmic changes to enable scaling to large machines can be found in [] and [].
Encyclopedia of Parallel Computing, 2011
Uploads
Papers by Pierfrancesco Foglia