Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006, Journal of Database Management
…
3 pages
1 file
Data Warehouses (DWs) with large quantities of data present major performance and scalability challenges, and parallelism can be used for major performance improvement in such context. However, instead of costly specialized parallel hardware and interconnections, we focus on low-cost standard computing nodes, possibly in a non-dedicated local network. In this environment, special care must be taken with partitioning and processing. We use experimental evidence to analyze the shortcomings of a basic horizontal partitioning strategy designed for that environment, then propose and test improvements to allow
Proceedings of the 7th ACM international workshop on Data warehousing and OLAP - DOLAP '04, 2004
Parallelism can be used for major performance improvement in large Data warehouses (DW) with performance and scalability challenges. A simple low-cost shared-nothing architecture with horizontally fully-partitioned facts can be used to speedup response time of the data warehouse significantly. However, extra overheads related to processing large replicated relations and repartitioning requirements between nodes can significantly degrade speedup performance for many query patterns if special care is not taken during placement to minimize such overheads. In this paper we show these problems experimentally with the help of the performance evaluation benchmark TPC-H and identify simple modifications that can minimize such undesirable extra overheads. We analyze experimentally a simple and easy-to-apply partitioning and placement decision that achieves good performance improvement results.
Proceedings of the 2008 International Symposium on Parallel and Distributed Processing with Applications, ISPA 2008, 2008
Much has been said about processing efficiently data in parallel database servers, and some data warehouse applications must process in the order of tens to hundreds of Gigabytes efficiently. Yet, there is no effective approach targeted at using non-dedicated low-cost platforms efficiently in this context. Imagine taking together 10 or 1000 commodity PCs and setting-up a data crunching platform for large database-resident data with acceptable performance. There are significant inter-related data layout and processing challenges when the computational, storage and network hardware are heterogeneous and slow. We propose how to place, replicate and load-balance the data efficiently in this context. This work innovates in several respects: being practically as fast as fullmirroring without its overhead, exploring schema, chunk-wise placement, replication and load-balanced processing to be faster and more flexible than previous efforts. Our findings are complemented by an evaluation using TPC-H performance benchmark queries.
Distributed and Parallel Databases, 2009
Consider data warehouses as large data repositories queried for analysis and data mining in a variety of application contexts. A query over such data may take a large amount of time to be processed in a regular PC. Consider partitioning the data into a set of PCs (nodes), with either a parallel database server or any database server at each node and an engine-independent middleware. Nodes and network may even not be fully dedicated to the data warehouse. In such a scenario, care must be taken for handling processing heterogeneity and availability, so we study and propose efficient solutions for this. We concentrate on three main contributions: a performance-wise index, measuring relative performance; a replication-degree; a flexible chunk-wise organization with on-demand processing. These contributions extend the previous work on de-clustering and replication and are generic in the sense that they can be applied in very different contexts and with different data partitioning approaches. We evaluate their merits with a prototype implementation of the system.
Since a relational data warehouse has the same physical structure as a classical database, it can enjoy all the benefits realized during the past in distributed databases such us data availability, simplicity, rapid local data access and transparent access to remote sites. So, it would be interesting to test its decentralization by adapting the most adequate fragmentation and allocation technique. In this paper, we present a data warehouse fragmentation and allocation approach in a distributed context. We conduct first, computation studies using a mathematical cost model. Then, we test our approach on a real data warehouse by using the APB1 benchmark data set on Oracle11G.
Lecture Notes in Computer Science, 2010
In this paper we propose a comprehensive methodology for designing Parallel Relational Data Warehouses (PRDW) over database clusters, called Fragmentation&Allocation (F&A). F&A assumes that cluster nodes are heterogeneous in processing power and storage capacity, contrary to traditional design approaches that assume that cluster nodes are instead homogeneous, and fragmentation and allocation phases are performed in a simultaneous manner, contrary to traditional design approaches that instead perform these phases in an isolated manner. Also, a naive replication algorithm that takes into account the heterogeneous characteristics of our reference architecture is proposed. Finally, our proposal is experimentally assessed and validated against the widely-known data warehouse benchmark APB-1 release II.
International Journal of Data Warehousing and Mining, 2009
Data Warehouses are a crucial technology for current competitive organizations in the globalized world. Size, speed and distributed operation are major challenges concerning those systems. Many data warehouses have huge sizes and the requirement that queries be processed quickly and efficiently, so parallel solutions are deployed to render the necessary efficiency. Distributed operation, on the other hand, concerns global commercial and scientific organizations that need to share their data in a coherent distributed data warehouse. In this paper we review the major concepts, systems and research results behind parallel and distributed data warehouses.
Very Large Data Bases, 2005
Grid computing has the potential of drastically changing enterprise computing as we know it today. The main concept of Grid computing is to see computing as a utility. It should not matter where data resides, or what computer processes a task. This concept has been applied successfully to academic research. It also has many advantages for commercial data warehouse applications such as virtualization, flexible provisioning, reduced cost due to commodity hardware, high availability and high scale-out. In this paper we show how a large-scale, high performing and scalable Grid based data warehouse can be implemented using commodity hardware (industry standard x86based), Oracle Database 10G and Linux operating system. We further demonstrate this architecture in a recently published TPC-H benchmark.
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015
Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in sharednothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated. In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.
2011
In recent years, Massively Parallel Processors (MPPs) have gained ground enabling vast amounts of data processing. In such environments, data is partitioned across multiple compute nodes, which results in dramatic performance improvements during parallel query execution. To evaluate certain relational operators in a query correctly, data sometimes needs to be re-partitioned (i.e., moved) across compute nodes. Since data movement operations are much more expensive than relational operations, it is crucial to design a suitable data partitioning strategy that minimizes the cost of such expensive data transfers. A good partitioning strategy strongly depends on how the parallel system would be used. In this paper we present a partitioning advisor that recommends the best partitioning design for an expected workload. Our tool recommends which tables should be replicated (i.e., copied into every compute node) and which ones should be distributed according to specific column(s) so that the cost of evaluating similar workloads is minimized. In contrast to previous work, our techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time. Our experimental evaluation using a real MPP system, Microsoft SQL Server 2008 Parallel Data Warehouse, with both real and synthetic workloads shows the effectiveness of the proposed techniques and the importance of deep integration of the partitioning advisor with the underlying query optimizer.
2005
Grid computing has the potential of drastically changing enterprise computing as we know it today. The main concept of Grid computing is to see computing as a utility. It should not matter where data resides, or what computer processes a task. This concept has been applied successfully to academic research. It also has many advantages for commercial data warehouse applications such as virtualization, flexible provisioning, reduced cost due to commodity hardware, high availability and high scale-out. In this paper we show how a large-scale, high performing and scalable Grid based data warehouse can be implemented using commodity hardware (industry standard x86based), Oracle Database 10G and Linux operating system. We further demonstrate this architecture in a recently published TPC-H benchmark.
Una aventura en Boadilla del Monte: la condesa de merlin y el palacio, 2019
Vite Parallele? La comunicazione politica al femminile di Giorgia Meloni e Elly Schlein, 2024
Urban Studies Special Issue: Long-Term Intergenerational Perspectives on Urban Sustainability Transitions, 2024
Wits University Press, 2021
Digital Commons @ University of Nebraska - Lincoln, 2020
Journal of Critical Studies in Language and Literature
OTESSA journal, 2022
Revista de la Academia Mexicana de Ciencias, 2024
Mini-Reviews in Medicinal Chemistry, 2015
The Annals of The Royal College of Surgeons of England, 2009
Diabetes Care, 2012
BMC nursing, 2024
Otolaryngology-Head and Neck Surgery, 1991
Pakistan Journal of Health Sciences, 2023
Scientific Journals of Rzeszów University of Technology, Series: Economics and Humanities, 2012
BioMed research international, 2014
Journal of the Medical Association of Thailand = Chotmaihet thangphaet, 2001
African conference on Information Systems and Technology, 2023