2007 IEEE Symposium on Computational Intelligence and Data Mining, 2007
Data mining is concerned with important aspects related to both database techniques and AI/machin... more Data mining is concerned with important aspects related to both database techniques and AI/machine learning mechanisms, and provides an excellent opportunity for exploring the interesting relationship between retrieval and inference/reasoning, a fundamental issue concerning the nature of data mining. In the data mining context, this relationship can be restated as connection and differences between data retrieval and data mining. In
Tests were carried out by real charged human body discharging to the ground. The results show tha... more Tests were carried out by real charged human body discharging to the ground. The results show that the peak-to-peak electric field radiated in the distance of several centimeters is in the range of 102-103 V/m and the magnetic field strength can be in the range of 10-102 A/m in the distance of 10 cm from the discharge. The spectrum of the field is extremely wide. The experiments also show that the amplitude of electric field radiated by ESD when human holding a metal tool discharging to the ground is about many times larger than that of the human finger discharging directly to the ground. Electrostatic discharge is one of the most common harmful electromagnetic sources to the electronic equipment in many environments. Tests also show that the captured waveform may be ringing which is stimulated by the fast rise ESD due to capacitance and inductance including any parasitic LC parameter of the probe and cable.
The original k-means clustering algorithm is designed to work primarily on numeric data sets. Thi... more The original k-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. The k-modes 11 algorithm [Z. Huang, Clusteing large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference. World Scientific, Singapore, 1997, pp. 21-34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is the case with most data 15 clustering algorithms, the algorithm requires a pre-setting or random selection of initial points (modes) of the clusters. The differences on the initial points often lead to considerable distinct cluster results. In this paper we present an experimental study on applying Bradley and Fayyad's iterative initial-point refinement algorithm to the k-modes clus-18 tering to improve the accurate and repetitiveness of the clustering results [cf. P. Bradley, U. Fayyad, Refining initial points fork-mean clustering, in: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, 1998]. Experiments show that the k-modes clustering algorithm using refined initial points 21 leads to higher precision results much more reliably than the random selection method without refinement, thus making the refinement process applicable to many data mining applications with categorical data.
IEEE International Conference on Computer Systems and Applications, 2006., 2006
R* tree is a useful data structure for handling spatial data. However, although objects stored in... more R* tree is a useful data structure for handling spatial data. However, although objects stored in the same R* tree leaf node enjoys spatial proximity, it is well-known that R* trees cannot be used directly for cluster analysis. Nevertheless, R* tree’s indexing feature can be used to assist existing cluster analysis methods, thus enhancing their performance or cluster quality. In this paper, we explore how to use R* trees to improve well-known Kmeans and hierarchical clustering methods. Based on R*Tree’s feature of indexing Minimum Bounding Box (MBB) according to spatial proximity, we extend R*-Tree’s application to cluster analysis of time series. Two improved algorithms, KMeans-R and Hierarchy-R, are proposed. The performance of these two methods is evaluated against K-Means and K-Means with sampling technique (KMeans-S) using similarity-oriented, supervised measures of cluster validity. Rand Index (RI), Adjusted Rand Index (ARI) and Information Gain (IG) are used as evaluation measures in our experiments. Compared with K-Means and KMeans-S, the clustering results from different data sets have shown that KMeans-R and Hierarchy-R have achieved better clustering quality.
Spatial data mining is a process of extraction of implicit information, such as weather patterns ... more Spatial data mining is a process of extraction of implicit information, such as weather patterns around latitudes, spatial features in a region, etc., with a goal of knowledge discovery. The work reported here is based on our earlier work on branch-grafted R trees. We have taken a bottom-up approach in our research: from efficient spatial data structure (i.e., branch-grafted R tree implementation), to efficient data access methods, and finally, to effective spatial data mining. Since previous experiments have shown that there are significant advantages of using branch-grafted implementation, this bottom-up approach exploits the performance advantages of the branch-grafted R-trees.
Knowledge discovery in databases (KDD) and data mining have good potential in many applications. ... more Knowledge discovery in databases (KDD) and data mining have good potential in many applications. However, in order to make KDD useful, many problems remain to be solved. One such problem is the query formulation problem: “What to do if one does not know how to specify the desired query to begin with?” In this paper we explore an approach to
Concurrency control in deductive databases is an important issue which deserves much attention. I... more Concurrency control in deductive databases is an important issue which deserves much attention. In this paper we examine implementation of locking schemes. We adopt a model based on dependency graphs extended with compatibility trees, and describe features related to implementation of locking schemes in this model. Algorithms for read and write locking schemes are provided, and are illustrated by several
We have developed the Recombinantly-produced Antimicrobial Peptides Database (RAPD) to house rele... more We have developed the Recombinantly-produced Antimicrobial Peptides Database (RAPD) to house relevant information on recombinant approaches to generate antimicrobial peptides. Key information stored in the database, which is extracted from published experiments, includes expression host, fusion strategy, release method and yield for individual peptides. Bibliographic data directly related to each particular case are also available. RAPD allows easy comparison of the relative popularity and efficiency of different strategies, and can thus be used as a guideline for future production of similar peptides. The database is freely available at http:// faculty.ist.unomaha.edu/chen/rapd/index.php.
Mathematical programming based methods have been applied to credit risk analysis and have proven ... more Mathematical programming based methods have been applied to credit risk analysis and have proven to be powerful tools. One challenging issue in mathematical programming is the computation complexity in finding optimal solutions. To overcome this difficulty, this paper proposes a Multi-criteria Convex Quadratic Programming model (MCCQP). Instead of looking for the global optimal solution, the proposed model only needs to solve a set of linear equations. We test the model using three credit risk analysis datasets and compare MCCQP results with four well-known classification methods: LDA, Decision Tree, SVMLight, and LibSVM. The experimental results indicate that the proposed MCCQP model achieves as good as or even better classification accuracies than other methods.
Medical data mining has been a popular data mining topic of late. Compared with other data mining... more Medical data mining has been a popular data mining topic of late. Compared with other data mining applications, medical data mining has some unique characteristics. Since medical records are related to human subjects, privacy protection is taken more seriously than other data mining tasks. This paper applied two data separation techniques – vertical and horizontal partition - to preserve privacy in medical data classification. In the vertical partition approach, each site uses a portion of the attributes to compute its results and the distributed results are assembled at a central trusted party using majority-vote ensemble method. In the horizontal partition approach, data are distributed among several sites. Each site computes its own data and a central trusted party integrate these results using ensemble. We implement these two approaches using medical datasets from UCI Machine Learning archive and report the experimental results.
Data mining and knowledge discovery has made great progress during the last fifteen years. As one... more Data mining and knowledge discovery has made great progress during the last fifteen years. As one of the major tasks of data mining, classification has wide business and scientific applications. Among a variety of proposed methods, mathematical programming based approaches have been proven to be excellent in terms of classification accuracy, robustness, and efficiency. However, there are several difficult issues. Two of these issues are of particular interest of this research. The first issue is that it is challenging to find optimal solution for large-scale dataset in mathematical programming problems due to the computational complexity. The second issue is that many mathematical programming problems require specialized codes or programs such as CPLEX or LINGO. The objective of this study is to propose solutions for these two problems. This paper proposed and applied mathematical programming model to classification problems to address two aspects of data mining algorithm: speed and scalability.
... KDD-99 dataset. The result of cross-validated MCQP indicates that MCQP prediction is stable. ... more ... KDD-99 dataset. The result of cross-validated MCQP indicates that MCQP prediction is stable. ... 6. Zenobi, G., Cunningham, P.: An Approach to Aggregating Ensembles of Lazy Learners That Supports Explanation. Lecture Notes in Computer Science, Vol. 2416 (2002) 436-447. ...
In credit card portfolio management a major challenge is to classify and predict credit cardholde... more In credit card portfolio management a major challenge is to classify and predict credit cardholders' behaviors in a reliable precision because cardholders' behaviors are rather dynamic in nature. Multiclass classification refers to classify data objects into more than two classes. Many real-life applications require multiclass classification. The purpose of this paper is to compare three multiclass classification approaches: decision tree, Multiple Criteria Mathematical Programming (MCMP), and Hierarchical Method for Support Vector Machines (SVM). While MCMP considers all classes at once, SVM was initially designed for binary classification. It is still an ongoing research issue to extend SVM from two-class classification to multiclass classification and many proposed approaches use hierarchical method. In this paper, we focus on one common hierarchical method-one-against-all classification. We compare the performance of See5, MCMP and SVM oneagainst-all approach using a real-life credit card dataset. Results show that MCMP achieves better overall accuracies than See5 and one-against-all SVM.
In credit card portfolio management, predicting the cardholders' behavior is a key to reduce the ... more In credit card portfolio management, predicting the cardholders' behavior is a key to reduce the charge off risk of credit card issuers. The most commonly used methods in predicting credit card defaulters are credit scoring models. Most of these credit scoring models use supervised classification methods. Although these methods have made considerable progress in bankruptcy prediction, they are unsuitable for data records without predefined class labels. Therefore, it is worthwhile to investigate the applicability of unsupervised learning methods in credit card accounts classification. The objectives of this paper are: (1) to explore an unsupervised learning method: cluster analysis, for credit card accounts classification, (2) to improve clustering classification results using ensemble and supervised learning methods. In particular, a general purpose clustering toolkit, CLUTO, from university of Minnesota, was used to classify a real-life credit card dataset and two supervised classification methods, decision tree and multiple-criteria linear programming (MCLP), were used to improve the clustering results. The classification results indicate that clustering can be used to either as a stand-alone classification method or as a preprocess step for supervised classification methods.
Finding a common pattern among nucleic acid sequences in a given database is an important yet rel... more Finding a common pattern among nucleic acid sequences in a given database is an important yet relatively difficult problem in computational biology. Such a pattern is useful for describing the characteristics of a certain family of nucleic acid sequences, and can also be used for classification purposes as well as examine the closeness of two organisms. In this paper, we present a global pattern extraction tool named GAPE which can be applicable in computational biology to describe a certain family of nucleic acid sequences with common features. The algorithm utilizes an optimized Genetic Algorithm (GA) framework to drive the evolution of desirable patterns. A specialized pair-wise alignment algorithm is also introduced to efficiently examine the closeness of a sequence to a regular expression pattern. Experimental results using real biological data are shown to indicate the effectiveness of the tool.
Resistance to temozolomide poses a major clinical challenge in glioblastoma multiforme treatment,... more Resistance to temozolomide poses a major clinical challenge in glioblastoma multiforme treatment, and the mechanisms underlying the development of temozolomide resistance remain poorly understood. Enhanced DNA repair and mutagenesis can allow tumour cells to survive, contributing to resistance and tumour recurrence. Here, using recurrent temozolomide-refractory glioblastoma specimens, temozolomide-resistant cells, and resistant-xenograft models, we report that loss of miR-29c via c-Myc drives the acquisition of temozolomide resistance through enhancement of REV3L-mediated DNA repair and mutagenesis in glioblastoma. Importantly, disruption of c-Myc/miR-29c/REV3L signalling may have dual anticancer effects, sensitizing the resistant tumours to therapy as well as preventing the emergence of acquired temozolomide resistance. Our findings suggest a rationale for targeting the c-Myc/miR-29c/REV3L signalling pathway as a promising therapeutic approach for glioblastoma, even in recurrent, t...
2007 IEEE Symposium on Computational Intelligence and Data Mining, 2007
Data mining is concerned with important aspects related to both database techniques and AI/machin... more Data mining is concerned with important aspects related to both database techniques and AI/machine learning mechanisms, and provides an excellent opportunity for exploring the interesting relationship between retrieval and inference/reasoning, a fundamental issue concerning the nature of data mining. In the data mining context, this relationship can be restated as connection and differences between data retrieval and data mining. In
Tests were carried out by real charged human body discharging to the ground. The results show tha... more Tests were carried out by real charged human body discharging to the ground. The results show that the peak-to-peak electric field radiated in the distance of several centimeters is in the range of 102-103 V/m and the magnetic field strength can be in the range of 10-102 A/m in the distance of 10 cm from the discharge. The spectrum of the field is extremely wide. The experiments also show that the amplitude of electric field radiated by ESD when human holding a metal tool discharging to the ground is about many times larger than that of the human finger discharging directly to the ground. Electrostatic discharge is one of the most common harmful electromagnetic sources to the electronic equipment in many environments. Tests also show that the captured waveform may be ringing which is stimulated by the fast rise ESD due to capacitance and inductance including any parasitic LC parameter of the probe and cable.
The original k-means clustering algorithm is designed to work primarily on numeric data sets. Thi... more The original k-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. The k-modes 11 algorithm [Z. Huang, Clusteing large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference. World Scientific, Singapore, 1997, pp. 21-34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is the case with most data 15 clustering algorithms, the algorithm requires a pre-setting or random selection of initial points (modes) of the clusters. The differences on the initial points often lead to considerable distinct cluster results. In this paper we present an experimental study on applying Bradley and Fayyad's iterative initial-point refinement algorithm to the k-modes clus-18 tering to improve the accurate and repetitiveness of the clustering results [cf. P. Bradley, U. Fayyad, Refining initial points fork-mean clustering, in: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, 1998]. Experiments show that the k-modes clustering algorithm using refined initial points 21 leads to higher precision results much more reliably than the random selection method without refinement, thus making the refinement process applicable to many data mining applications with categorical data.
IEEE International Conference on Computer Systems and Applications, 2006., 2006
R* tree is a useful data structure for handling spatial data. However, although objects stored in... more R* tree is a useful data structure for handling spatial data. However, although objects stored in the same R* tree leaf node enjoys spatial proximity, it is well-known that R* trees cannot be used directly for cluster analysis. Nevertheless, R* tree’s indexing feature can be used to assist existing cluster analysis methods, thus enhancing their performance or cluster quality. In this paper, we explore how to use R* trees to improve well-known Kmeans and hierarchical clustering methods. Based on R*Tree’s feature of indexing Minimum Bounding Box (MBB) according to spatial proximity, we extend R*-Tree’s application to cluster analysis of time series. Two improved algorithms, KMeans-R and Hierarchy-R, are proposed. The performance of these two methods is evaluated against K-Means and K-Means with sampling technique (KMeans-S) using similarity-oriented, supervised measures of cluster validity. Rand Index (RI), Adjusted Rand Index (ARI) and Information Gain (IG) are used as evaluation measures in our experiments. Compared with K-Means and KMeans-S, the clustering results from different data sets have shown that KMeans-R and Hierarchy-R have achieved better clustering quality.
Spatial data mining is a process of extraction of implicit information, such as weather patterns ... more Spatial data mining is a process of extraction of implicit information, such as weather patterns around latitudes, spatial features in a region, etc., with a goal of knowledge discovery. The work reported here is based on our earlier work on branch-grafted R trees. We have taken a bottom-up approach in our research: from efficient spatial data structure (i.e., branch-grafted R tree implementation), to efficient data access methods, and finally, to effective spatial data mining. Since previous experiments have shown that there are significant advantages of using branch-grafted implementation, this bottom-up approach exploits the performance advantages of the branch-grafted R-trees.
Knowledge discovery in databases (KDD) and data mining have good potential in many applications. ... more Knowledge discovery in databases (KDD) and data mining have good potential in many applications. However, in order to make KDD useful, many problems remain to be solved. One such problem is the query formulation problem: “What to do if one does not know how to specify the desired query to begin with?” In this paper we explore an approach to
Concurrency control in deductive databases is an important issue which deserves much attention. I... more Concurrency control in deductive databases is an important issue which deserves much attention. In this paper we examine implementation of locking schemes. We adopt a model based on dependency graphs extended with compatibility trees, and describe features related to implementation of locking schemes in this model. Algorithms for read and write locking schemes are provided, and are illustrated by several
We have developed the Recombinantly-produced Antimicrobial Peptides Database (RAPD) to house rele... more We have developed the Recombinantly-produced Antimicrobial Peptides Database (RAPD) to house relevant information on recombinant approaches to generate antimicrobial peptides. Key information stored in the database, which is extracted from published experiments, includes expression host, fusion strategy, release method and yield for individual peptides. Bibliographic data directly related to each particular case are also available. RAPD allows easy comparison of the relative popularity and efficiency of different strategies, and can thus be used as a guideline for future production of similar peptides. The database is freely available at http:// faculty.ist.unomaha.edu/chen/rapd/index.php.
Mathematical programming based methods have been applied to credit risk analysis and have proven ... more Mathematical programming based methods have been applied to credit risk analysis and have proven to be powerful tools. One challenging issue in mathematical programming is the computation complexity in finding optimal solutions. To overcome this difficulty, this paper proposes a Multi-criteria Convex Quadratic Programming model (MCCQP). Instead of looking for the global optimal solution, the proposed model only needs to solve a set of linear equations. We test the model using three credit risk analysis datasets and compare MCCQP results with four well-known classification methods: LDA, Decision Tree, SVMLight, and LibSVM. The experimental results indicate that the proposed MCCQP model achieves as good as or even better classification accuracies than other methods.
Medical data mining has been a popular data mining topic of late. Compared with other data mining... more Medical data mining has been a popular data mining topic of late. Compared with other data mining applications, medical data mining has some unique characteristics. Since medical records are related to human subjects, privacy protection is taken more seriously than other data mining tasks. This paper applied two data separation techniques – vertical and horizontal partition - to preserve privacy in medical data classification. In the vertical partition approach, each site uses a portion of the attributes to compute its results and the distributed results are assembled at a central trusted party using majority-vote ensemble method. In the horizontal partition approach, data are distributed among several sites. Each site computes its own data and a central trusted party integrate these results using ensemble. We implement these two approaches using medical datasets from UCI Machine Learning archive and report the experimental results.
Data mining and knowledge discovery has made great progress during the last fifteen years. As one... more Data mining and knowledge discovery has made great progress during the last fifteen years. As one of the major tasks of data mining, classification has wide business and scientific applications. Among a variety of proposed methods, mathematical programming based approaches have been proven to be excellent in terms of classification accuracy, robustness, and efficiency. However, there are several difficult issues. Two of these issues are of particular interest of this research. The first issue is that it is challenging to find optimal solution for large-scale dataset in mathematical programming problems due to the computational complexity. The second issue is that many mathematical programming problems require specialized codes or programs such as CPLEX or LINGO. The objective of this study is to propose solutions for these two problems. This paper proposed and applied mathematical programming model to classification problems to address two aspects of data mining algorithm: speed and scalability.
... KDD-99 dataset. The result of cross-validated MCQP indicates that MCQP prediction is stable. ... more ... KDD-99 dataset. The result of cross-validated MCQP indicates that MCQP prediction is stable. ... 6. Zenobi, G., Cunningham, P.: An Approach to Aggregating Ensembles of Lazy Learners That Supports Explanation. Lecture Notes in Computer Science, Vol. 2416 (2002) 436-447. ...
In credit card portfolio management a major challenge is to classify and predict credit cardholde... more In credit card portfolio management a major challenge is to classify and predict credit cardholders' behaviors in a reliable precision because cardholders' behaviors are rather dynamic in nature. Multiclass classification refers to classify data objects into more than two classes. Many real-life applications require multiclass classification. The purpose of this paper is to compare three multiclass classification approaches: decision tree, Multiple Criteria Mathematical Programming (MCMP), and Hierarchical Method for Support Vector Machines (SVM). While MCMP considers all classes at once, SVM was initially designed for binary classification. It is still an ongoing research issue to extend SVM from two-class classification to multiclass classification and many proposed approaches use hierarchical method. In this paper, we focus on one common hierarchical method-one-against-all classification. We compare the performance of See5, MCMP and SVM oneagainst-all approach using a real-life credit card dataset. Results show that MCMP achieves better overall accuracies than See5 and one-against-all SVM.
In credit card portfolio management, predicting the cardholders' behavior is a key to reduce the ... more In credit card portfolio management, predicting the cardholders' behavior is a key to reduce the charge off risk of credit card issuers. The most commonly used methods in predicting credit card defaulters are credit scoring models. Most of these credit scoring models use supervised classification methods. Although these methods have made considerable progress in bankruptcy prediction, they are unsuitable for data records without predefined class labels. Therefore, it is worthwhile to investigate the applicability of unsupervised learning methods in credit card accounts classification. The objectives of this paper are: (1) to explore an unsupervised learning method: cluster analysis, for credit card accounts classification, (2) to improve clustering classification results using ensemble and supervised learning methods. In particular, a general purpose clustering toolkit, CLUTO, from university of Minnesota, was used to classify a real-life credit card dataset and two supervised classification methods, decision tree and multiple-criteria linear programming (MCLP), were used to improve the clustering results. The classification results indicate that clustering can be used to either as a stand-alone classification method or as a preprocess step for supervised classification methods.
Finding a common pattern among nucleic acid sequences in a given database is an important yet rel... more Finding a common pattern among nucleic acid sequences in a given database is an important yet relatively difficult problem in computational biology. Such a pattern is useful for describing the characteristics of a certain family of nucleic acid sequences, and can also be used for classification purposes as well as examine the closeness of two organisms. In this paper, we present a global pattern extraction tool named GAPE which can be applicable in computational biology to describe a certain family of nucleic acid sequences with common features. The algorithm utilizes an optimized Genetic Algorithm (GA) framework to drive the evolution of desirable patterns. A specialized pair-wise alignment algorithm is also introduced to efficiently examine the closeness of a sequence to a regular expression pattern. Experimental results using real biological data are shown to indicate the effectiveness of the tool.
Resistance to temozolomide poses a major clinical challenge in glioblastoma multiforme treatment,... more Resistance to temozolomide poses a major clinical challenge in glioblastoma multiforme treatment, and the mechanisms underlying the development of temozolomide resistance remain poorly understood. Enhanced DNA repair and mutagenesis can allow tumour cells to survive, contributing to resistance and tumour recurrence. Here, using recurrent temozolomide-refractory glioblastoma specimens, temozolomide-resistant cells, and resistant-xenograft models, we report that loss of miR-29c via c-Myc drives the acquisition of temozolomide resistance through enhancement of REV3L-mediated DNA repair and mutagenesis in glioblastoma. Importantly, disruption of c-Myc/miR-29c/REV3L signalling may have dual anticancer effects, sensitizing the resistant tumours to therapy as well as preventing the emergence of acquired temozolomide resistance. Our findings suggest a rationale for targeting the c-Myc/miR-29c/REV3L signalling pathway as a promising therapeutic approach for glioblastoma, even in recurrent, t...
Uploads
Papers by Zhengxin Chen