In update intensive main memory database applications, huge volume of log records is generated, t... more In update intensive main memory database applications, huge volume of log records is generated, to maintain the ACID properties of the database system, the log records should be persistent efficiently. Delegating logging of one main memory database to another main memory database is proposed. The scheme is elaborated in detail in terms of architecture, logging & safeness levels, checkpointing, and recovery. Strict durability and relax durability are provided. When some form of non-volatile memory is used to temporarily holding log records, not only logging efficiency is improved, but also the scheme could guarantee full ACID of the system. We also propose using parallel logging to speedup log persistence by writing logs to multiple disks in parallel. Since interconnection network techniques progress by leaps and bounds, the scheme eliminates the concern about whether the system's overall performance may be slowed down by bandwidth and latency limitations. Experiment results demonstrate the feasibility of the proposal.
17.67 % over the old system. KEYWORDS: font-banking system 1 ; XML hierarchy 2 ; XSLT 3 ; flatten... more 17.67 % over the old system. KEYWORDS: font-banking system 1 ; XML hierarchy 2 ; XSLT 3 ; flatten out 4 ; accelerate web page generation 5 2010 International Forum on Information Technology and Applications 978-0-7695-4115-0/10 $26.00
Traditional multidimensional histograms, which are widely used in cardinality estimation for conj... more Traditional multidimensional histograms, which are widely used in cardinality estimation for conjunctive range query predicates in RDBMS's query optimizers, take the assumption of the existence of correlations among attributes instead of the plausible AVI assumption. But they do not ...
Traditional web database cache techniques have a major disadvantage, namely poor data freshness, ... more Traditional web database cache techniques have a major disadvantage, namely poor data freshness, because they employ an asynchronous data refresh strategy. A novel web database cache, DB Façade, is proposed in this paper. DB Facade uses a main memory database to cache result sets of previous queries for subsequent reusing. Updates on backend database system are managed by delta tables, and then propagated to web database cache in a near real-time manner, hence guarantee the freshness of data. DB Facade offloads query burden from backend database systems, and exploits the power of main memory database system to boost query performance. TPC-W testing result shows that the system's performance increases by about 17 percent.
Detecting and exploiting correlations among columns in relational databases are of great value fo... more Detecting and exploiting correlations among columns in relational databases are of great value for query optimizers to generate better query execution plans (QEPs). We propose a more robust and informative metric, namely, entropy correlation coefficients, other than chi-square test to detect correlations among columns in large datasets. We introduce a novel yet simple kind of multi-dimensional synopses named COCA-Hist to cope with different correlations in databases. With the aid of the precise metric of entropy correlation coefficients, correlations of various degrees can be detected effectively; when correlation coefficients testify to mutual independence among columns, the AVI (attribute value independence) assumption can be adopted undoubtedly. COCA can also serve as a data-mining tool with superior qualities as CORDS does. We demonstrate the effectiveness and accuracy of our approach by several experiments.
Operating on computer clusters, parallel databases enjoy enhanced performance. However, the scala... more Operating on computer clusters, parallel databases enjoy enhanced performance. However, the scalability of a parallel database is limited by a number of factors. Although MapReduce-based systems are highly scalable, their performance is not satisfactory for data intensive applications. In this paper, we explore the feasibility of building a data warehouse that incorporates the best features from both technologies – the efficiency of parallel database and the scalability and fault tolerance of MapReduce. Towards this target, we design a prototype system called LinearDB. LinearDB organizes data in a decomposed snowflake schema and adopts three operations – transform, reduce and merge – to accomplish query processing. All these techniques are specially designed for the cluster environment. Our experimental results show that its scalability matches MapReduce and its performance is up to 3 times as good as that of PostgreSQL.
The paper describes the details of using J-SIM in main memory database parallel recovery simulati... more The paper describes the details of using J-SIM in main memory database parallel recovery simulation. In update intensive main memory database systems, I/O is still the dominant performance bottleneck. A proposal of parallel recovery scheme for large-scale update intensive main memory database systems is presented. Simulation provides a faster way of evaluating the new idea compared to actual system implementation. J-SIM is an open source discrete time simulation software package. The simulation implementation using J-SIM is elaborated in terms of resource modeling, transaction processing system modeling and workload modeling. Finally, with simulation results analyzed, the effectiveness of the parallel recovery scheme is verified and the feasibility of J-SIM's application in main memory database system simulation is demonstrated.
In algorithm trading, computer algorithms are used to make the decision on the time, quantity, an... more In algorithm trading, computer algorithms are used to make the decision on the time, quantity, and direction of operations (buy, sell, or hold) automatically. To create a useful algorithm, the parameters of the algorithm should be optimized based on historical data. However, Parameter optimization is a time consuming task, due to the large search space. We propose to search the parameter combination space using the MapReduce framework, with the expectation that runtime of optimization be cut down by leveraging the parallel processing capability of MapReduce. This paper presents the details of our method and some experiment results to demonstrate its efficiency. We also show that a rule based strategy after being optimized performs better in terms of stability than the one whose parameters are arbitrarily preset, while making a comparable profit.
With the system becoming more complex and workloads becoming more fluctuating, it is very hard fo... more With the system becoming more complex and workloads becoming more fluctuating, it is very hard for DBA to quickly analyze performance data and optimize the system, self optimization is a promising technique. A data mining based optimization scheme for the lock table in database systems is presented. After trained with performance data, a neural network become intelligent enough to predict system performance with newly provided configuration parameters and performance data. During system running, performance data is collected continuously for a rule engine, which chooses the proper parameter of the lock table for adjusting, the rule engine relies on the trained neural network to precisely provide the amount of adjustment. The selected parameter is adjusted accordingly. The scheme is implemented and tested with TPC-C workload, system throughput increases by about 16 percent.
Futures trading evaluation system is used to analyze trading history of individuals, to find out ... more Futures trading evaluation system is used to analyze trading history of individuals, to find out the root cause of profit and loss, so that investors can learn from their past and make better decisions in the future. To analyze trading history of investors, the system processes a large volume of transaction data, to calculate key performance indicators, as well as time series behavior patterns, finally concludes recommendations with the help of an expert knowledge base. The paper firstly presents the working logic of the evaluation system, then it focuses on parallel data processing techniques that the system is based on. Parallel processing architecture, data distribution scheme, key performance indicators calculating algorithms and distributed time series analysis algorithms are elaborated in details. The system is highly scalable, and by exploiting the power of parallel processing, the generation time of an evaluation report is cut down from 1 to 3 minute, to 30 to 45 seconds.
In update intensive applications, main memory database systems produce large volume of log record... more In update intensive applications, main memory database systems produce large volume of log records, it is critical to write out the log records efficiently to speedup transaction processing. We propose a parallel recovery scheme based on XOR differential logging for main memory database systems in such environments. Some NVRAM is used to temporarily hold log records and decouple transaction committing from disk writes, inherited parallelism properties of differential logging are exploited to accelerate log flushing by using multiple log disks. During recovery, log records are loaded from multiple log disks and applied to data partition in time without the need of reordering according to serialization order, total recovery time is cut down. The scheme employs a data partition based consistent checkpointing method. The log records are classified according to IDs of data partitions accessed. Data partitions are recovered according to loading priorities computed from update frequencies and transaction waiting times, data access demands of new transactions coming after failure recovery are given attention immediately, thus the scheme provides system availability during recovery, which is of importance for large scale main memory database systems.
Big data analysis is a main challenge we meet recently. Cloud computing is attracting more and mo... more Big data analysis is a main challenge we meet recently. Cloud computing is attracting more and more big data analysis applications, due to its well scalability and faulttolerance. Some aggregation functions, like SUM, can be computed in parallel, because they satisfy distributive law of addition. Unfortunately, some of statistical functions are not naturally parallelizable. That means they do not satisfy distributive law of addition. In this paper, we focus on percentile computing problem. We proposed an iterative-style prediction-based parallel algorithm in a distributed system. Prediction is done through a sampling technique. Experiment results verify the efficiency of our algorithm.
In update intensive main memory database applications, huge volume of log records is generated, t... more In update intensive main memory database applications, huge volume of log records is generated, to maintain the ACID properties of the database system, the log records should be persistent efficiently. Delegating logging of one main memory database to another main memory database is proposed. The scheme is elaborated in detail in terms of architecture, logging & safeness levels, checkpointing, and recovery. Strict durability and relax durability are provided. When some form of non-volatile memory is used to temporarily holding log records, not only logging efficiency is improved, but also the scheme could guarantee full ACID of the system. We also propose using parallel logging to speedup log persistence by writing logs to multiple disks in parallel. Since interconnection network techniques progress by leaps and bounds, the scheme eliminates the concern about whether the system's overall performance may be slowed down by bandwidth and latency limitations. Experiment results demonstrate the feasibility of the proposal.
17.67 % over the old system. KEYWORDS: font-banking system 1 ; XML hierarchy 2 ; XSLT 3 ; flatten... more 17.67 % over the old system. KEYWORDS: font-banking system 1 ; XML hierarchy 2 ; XSLT 3 ; flatten out 4 ; accelerate web page generation 5 2010 International Forum on Information Technology and Applications 978-0-7695-4115-0/10 $26.00
Traditional multidimensional histograms, which are widely used in cardinality estimation for conj... more Traditional multidimensional histograms, which are widely used in cardinality estimation for conjunctive range query predicates in RDBMS's query optimizers, take the assumption of the existence of correlations among attributes instead of the plausible AVI assumption. But they do not ...
Traditional web database cache techniques have a major disadvantage, namely poor data freshness, ... more Traditional web database cache techniques have a major disadvantage, namely poor data freshness, because they employ an asynchronous data refresh strategy. A novel web database cache, DB Façade, is proposed in this paper. DB Facade uses a main memory database to cache result sets of previous queries for subsequent reusing. Updates on backend database system are managed by delta tables, and then propagated to web database cache in a near real-time manner, hence guarantee the freshness of data. DB Facade offloads query burden from backend database systems, and exploits the power of main memory database system to boost query performance. TPC-W testing result shows that the system's performance increases by about 17 percent.
Detecting and exploiting correlations among columns in relational databases are of great value fo... more Detecting and exploiting correlations among columns in relational databases are of great value for query optimizers to generate better query execution plans (QEPs). We propose a more robust and informative metric, namely, entropy correlation coefficients, other than chi-square test to detect correlations among columns in large datasets. We introduce a novel yet simple kind of multi-dimensional synopses named COCA-Hist to cope with different correlations in databases. With the aid of the precise metric of entropy correlation coefficients, correlations of various degrees can be detected effectively; when correlation coefficients testify to mutual independence among columns, the AVI (attribute value independence) assumption can be adopted undoubtedly. COCA can also serve as a data-mining tool with superior qualities as CORDS does. We demonstrate the effectiveness and accuracy of our approach by several experiments.
Operating on computer clusters, parallel databases enjoy enhanced performance. However, the scala... more Operating on computer clusters, parallel databases enjoy enhanced performance. However, the scalability of a parallel database is limited by a number of factors. Although MapReduce-based systems are highly scalable, their performance is not satisfactory for data intensive applications. In this paper, we explore the feasibility of building a data warehouse that incorporates the best features from both technologies – the efficiency of parallel database and the scalability and fault tolerance of MapReduce. Towards this target, we design a prototype system called LinearDB. LinearDB organizes data in a decomposed snowflake schema and adopts three operations – transform, reduce and merge – to accomplish query processing. All these techniques are specially designed for the cluster environment. Our experimental results show that its scalability matches MapReduce and its performance is up to 3 times as good as that of PostgreSQL.
The paper describes the details of using J-SIM in main memory database parallel recovery simulati... more The paper describes the details of using J-SIM in main memory database parallel recovery simulation. In update intensive main memory database systems, I/O is still the dominant performance bottleneck. A proposal of parallel recovery scheme for large-scale update intensive main memory database systems is presented. Simulation provides a faster way of evaluating the new idea compared to actual system implementation. J-SIM is an open source discrete time simulation software package. The simulation implementation using J-SIM is elaborated in terms of resource modeling, transaction processing system modeling and workload modeling. Finally, with simulation results analyzed, the effectiveness of the parallel recovery scheme is verified and the feasibility of J-SIM's application in main memory database system simulation is demonstrated.
In algorithm trading, computer algorithms are used to make the decision on the time, quantity, an... more In algorithm trading, computer algorithms are used to make the decision on the time, quantity, and direction of operations (buy, sell, or hold) automatically. To create a useful algorithm, the parameters of the algorithm should be optimized based on historical data. However, Parameter optimization is a time consuming task, due to the large search space. We propose to search the parameter combination space using the MapReduce framework, with the expectation that runtime of optimization be cut down by leveraging the parallel processing capability of MapReduce. This paper presents the details of our method and some experiment results to demonstrate its efficiency. We also show that a rule based strategy after being optimized performs better in terms of stability than the one whose parameters are arbitrarily preset, while making a comparable profit.
With the system becoming more complex and workloads becoming more fluctuating, it is very hard fo... more With the system becoming more complex and workloads becoming more fluctuating, it is very hard for DBA to quickly analyze performance data and optimize the system, self optimization is a promising technique. A data mining based optimization scheme for the lock table in database systems is presented. After trained with performance data, a neural network become intelligent enough to predict system performance with newly provided configuration parameters and performance data. During system running, performance data is collected continuously for a rule engine, which chooses the proper parameter of the lock table for adjusting, the rule engine relies on the trained neural network to precisely provide the amount of adjustment. The selected parameter is adjusted accordingly. The scheme is implemented and tested with TPC-C workload, system throughput increases by about 16 percent.
Futures trading evaluation system is used to analyze trading history of individuals, to find out ... more Futures trading evaluation system is used to analyze trading history of individuals, to find out the root cause of profit and loss, so that investors can learn from their past and make better decisions in the future. To analyze trading history of investors, the system processes a large volume of transaction data, to calculate key performance indicators, as well as time series behavior patterns, finally concludes recommendations with the help of an expert knowledge base. The paper firstly presents the working logic of the evaluation system, then it focuses on parallel data processing techniques that the system is based on. Parallel processing architecture, data distribution scheme, key performance indicators calculating algorithms and distributed time series analysis algorithms are elaborated in details. The system is highly scalable, and by exploiting the power of parallel processing, the generation time of an evaluation report is cut down from 1 to 3 minute, to 30 to 45 seconds.
In update intensive applications, main memory database systems produce large volume of log record... more In update intensive applications, main memory database systems produce large volume of log records, it is critical to write out the log records efficiently to speedup transaction processing. We propose a parallel recovery scheme based on XOR differential logging for main memory database systems in such environments. Some NVRAM is used to temporarily hold log records and decouple transaction committing from disk writes, inherited parallelism properties of differential logging are exploited to accelerate log flushing by using multiple log disks. During recovery, log records are loaded from multiple log disks and applied to data partition in time without the need of reordering according to serialization order, total recovery time is cut down. The scheme employs a data partition based consistent checkpointing method. The log records are classified according to IDs of data partitions accessed. Data partitions are recovered according to loading priorities computed from update frequencies and transaction waiting times, data access demands of new transactions coming after failure recovery are given attention immediately, thus the scheme provides system availability during recovery, which is of importance for large scale main memory database systems.
Big data analysis is a main challenge we meet recently. Cloud computing is attracting more and mo... more Big data analysis is a main challenge we meet recently. Cloud computing is attracting more and more big data analysis applications, due to its well scalability and faulttolerance. Some aggregation functions, like SUM, can be computed in parallel, because they satisfy distributive law of addition. Unfortunately, some of statistical functions are not naturally parallelizable. That means they do not satisfy distributive law of addition. In this paper, we focus on percentile computing problem. We proposed an iterative-style prediction-based parallel algorithm in a distributed system. Prediction is done through a sampling technique. Experiment results verify the efficiency of our algorithm.
Uploads
Papers by Xiongpai Qin