Skip to main content

Pengyi Yang

The University of Sydney, School of Information Technologies, Graduate Student

Followers

39

Following

4

Public Views

Interests

Uploads

Papers by Pengyi Yang

Improving X! Tandem on peptide identification from mass spectrometry by self-boosted Percolator

A critical component in mass spectrometry (MS)-based proteomics is an accurate protein identifica... more A critical component in mass spectrometry (MS)-based proteomics is an accurate protein identification procedure. Database search algorithms commonly generate a list of peptide-spectrum matches (PSMs). The validity of these PSMs is critical for downstream analysis since proteins that are present in the sample are inferred from those PSMs. A variety of post-processing algorithms have been proposed to validate and filter PSMs. Among them, the most popular ones include a semi-supervised learning (SSL) approach known as Percolator and an empirical modeling approach known as PeptideProphet. However, they are predominantly designed for commercial database search algorithms i.e. SEQUEST and MASCOT. Therefore, it is highly desirable to extend and optimize those PSM post-processing algorithms for open source database search algorithms such as X!Tandem. In this study, we propose a Self-boosted Percolator for post-processing X!Tandem search results. We find that the SSL algorithm utilized by Percolator depends heavily on the initial ranking of PSMs. Starting with a poor PSM ranking list may cause Percolator to perform suboptimally. By implementing Percolator in a cascade learning manner, we can progressively improve the performance through multiple boost runs, enabling many more PSM identifications without sacrificing false discovery rate (FDR).

ENSEMBLE METHODS AND HYBRID ALGORITHMS FOR COMPUTATIONAL AND SYSTEMS BIOLOGY

Modern molecular biology increasingly relies on the application of high-throughput technologies f... more Modern molecular biology increasingly relies on the application of high-throughput technologies for studying the function, interaction, and integration of genes, proteins, and a variety of other molecules on a large scale. The application of those highthroughput technologies has led to the exponential growth of biological data, making modern molecular biology a data-intensive science. Huge effort has been directed to the development of robust and efficient computational algorithms in order to make sense of these extremely large and complex biological data, giving rise to several interdisciplinary fields, such as computational and systems biology.

Re-Fraction: A Machine Learning Approach for Deterministic Identification of Protein Homolog

A key step in the analysis of mass spectrometry (MS)-based proteomics data is the inference of pr... more A key step in the analysis of mass spectrometry (MS)-based proteomics data is the inference of proteins from identified peptide sequences. Here we describe Re-Fraction, a novel machine learning algorithm that enhances deterministic protein identification. Re-Fraction utilizes several protein physical properties to assign proteins to expected protein fractions that comprise large-scale MS-based proteomics data. This information is then used to appropriately assign peptides to specific proteins. This approach is sensitive, highly specific, and computationally efficient. We provide algorithms and source code for the current version of Re-Fraction, which accepts output tables from the MaxQuant environment. Nevertheless, the principles behind Re-Fraction can be applied to other protein identification pipelines where data are generated from samples fractionated at the protein level. We demonstrate the utility of this approach through reanalysis of data from a previously published study and generate lists of proteins deterministically identified by Re-Fraction that were previously only identified as members of a protein group. We find that this approach is particularly useful in resolving protein groups composed of splice variants and homologues, which are frequently expressed in a cell-or tissue-specific manner and may have important biological consequences.

Assignment 2 (Part 1, 2): Literature Review & Research Approach Outline

Feature selection techniques are critical to the analysis of high dimensional datasets . This is ... more Feature selection techniques are critical to the analysis of high dimensional datasets . This is especially true in gene selection of microarrays because such datasets often contain a limited number of training samples but large amount of features, under the assumption that only several of which are strongly associated with the classification task while others are redundant and noisy . The challenge is how to select the most informative gene subset from the original set for classifier creation and accurate sample classification.

Sample Subset Optimization for Classifying Imbalanced Biological Data

Advances in Knowledge Discovery …, Jan 1, 2011

Data in many biological problems are often compounded by imbalanced class distribution. That is, ... more Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.

A review of ensemble methods in bioinformatics

Current …, Jan 1, 2010

Ensemble learning is an intensively studies technique in machine learning and pattern recognition... more Ensemble learning is an intensively studies technique in machine learning and pattern recognition. Recent work in computational biology has seen an increasing use of ensemble learning methods due to their unique advantages in dealing with small sample size, high-dimensionality, and complexity data structures. The aim of this article is two-fold. First, it is to provide a review of the most widely used ensemble learning methods and their application in various bioinformatics problems, including the main topics of gene expression, mass spectrometry-based proteomics, gene-gene interaction identification from genome-wide association studies, and prediction of regulatory elements from DNA and protein sequences. Second, we try to identify and summarize future trends of ensemble methods in bioinformatics. Promising directions such as ensemble of support vector machine, meta-ensemble, and ensemble based feature selection are discussed.

Multiagent Based Bio-data Mining

rp-www.cs.usyd.edu.au

This paper argues for applying multiagent based data mining technologies to biological data analy... more This paper argues for applying multiagent based data mining technologies to biological data analysis. The rationale is described from multiple perspectives with an emphasize on biological context. Followed by that, an initial multiagent based bio-data mining framework is conceived, and a prototype system is developed to demonstrate how it helps the biologists who are often unfamiliar with data mining technologies to perform a comprehensive mining task for answering biological questions. The system offers a new way to reuse biological datasets and available data mining algorithms at their fullest.

An embedded two-layer feature selection approach for microarray data analysis

IEEE intelligent informatics bulletin, Jan 1, 2011

Feature selection is an important technique in dealing with application problems with large numbe... more Feature selection is an important technique in dealing with application problems with large number of variables and limited training samples, such as image processing, combinatorial chemistry, and microarray analysis. Commonly employed feature selection strategies can be divided into filter and wrapper. In this study, we propose an embedded two-layer feature selection approach to combining the advantages of filter and wrapper algorithms while avoiding their drawbacks. The hybrid algorithm, called GAEF (Genetic Algorithm with embedded filter), divides the feature selection process into two stages. In the first stage, Genetic Algorithm (GA) is employed to pre-select features while in the second stage a filter selector is used to further identify a small feature subset for accurate sample classification. Three benchmark microarray datasets are used to evaluate the proposed algorithm. The experimental results suggest that this embedded two-layer feature selection strategy is able to improve the stability of the selection results as well as the sample classification accuracy.

Agent-Based Hybrid Approach for Bioinformatics

computer.org

As Edward Keedwell and Ajit Narayana 2 point out, hybrid approaches are useful for solving variou... more

A dynamic wavelet-based algorithm for pre-processing tandem mass spectrometry data

Bioinformatics, Jan 1, 2010

Mass spectrometry (MS)-based proteomics is one of the most commonly used research techniques for ... more Mass spectrometry (MS)-based proteomics is one of the most commonly used research techniques for identifying and characterizing proteins in biological and medical research. The identification of a protein is the critical first step in elucidating its biological function. Successful protein identification depends on various interrelated factors, including effective analysis of MS data generated in a proteomic experiment. This analysis comprises several stages, often combined in a pipeline or workflow. The first component of the analysis is known as spectra pre-processing. In this component, the raw data generated by the mass spectrometer is processed to eliminate noise and identify the mass-to-charge ratio (m/z) and intensity for the peaks in the spectrum corresponding to the presence of certain peptides or peptide fragments. Since all downstream analyses depend on the pre-processed data, effective pre-processing is critical to protein identification and characterization. There is a critical need for more robust pre-processing algorithms that perform well on tandem mass spectra under a variety of different conditions and can be easily integrated into sophisticated data analysis pipelines for practical wet-lab applications. We have developed a new pre-processing algorithm. Based on wavelet theory, our method uses a dynamic peak model to identify peaks. It is designed to be easily integrated into a complete proteomic analysis workflow. We compared the method with other available algorithms using a reference library of raw MS and tandem MS spectra with known protein composition information. Our pre-processing algorithm results in the identification of significantly more peptides and proteins in the downstream analysis for a given false discovery rate. Software available at: http://www.maths.usyd.edu.au/u/penghao/index.html.

Genetic algorithm-based multi-objective optimisation for QoS-aware web services composition

Knowledge Science, Engineering …, Jan 1, 2011

Finding an optimal solution for QoS-aware Web service composition with various restrictions on qu... more Finding an optimal solution for QoS-aware Web service composition with various restrictions on qualities is a multi-objective optimisation problem. A popular multi-objective genetic algorithm, NSGA-II, is studied in order to provide a set of optimal solutions for QoS-based service composition. Experiments with different numbers of abstract and concrete services confirm the expected behaviour of the algorithm.

A particle swarm based hybrid system for imbalanced medical data sampling

BMC genomics, Jan 1, 2009

Background: Medical and biological data are commonly with small sample size, missing values, and ... more Background: Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.

Hierarchical kernel mixture models for the prediction of AIDS disease progression using HIV structural gp120 profiles

BMC …, Jan 1, 2010

Changes to the glycosylation profile on HIV gp120 can influence viral pathogenesis and alter AIDS... more Changes to the glycosylation profile on HIV gp120 can influence viral pathogenesis and alter AIDS disease progression. The characterization of glycosylation differences at the sequence level is inadequate as the placement of carbohydrates is structurally complex. However, no structural framework is available to date for the study of HIV disease progression. In this study, we propose a novel machine-learning based framework for the prediction of AIDS disease progression in three stages (RP, SP, and LTNP) using the HIV structural gp120 profile. This new intelligent framework proves to be accurate and provides an important benchmark for predicting AIDS disease progression computationally. The model is trained using a novel HIV gp120 glycosylation structural profile to detect possible stages of AIDS disease progression for the target sequences of HIV + individuals. The performance of the proposed model was compared to seven existing different machine-learning models on newly proposed gp120-Benchmark_1 dataset in terms of error-rate (MSE), accuracy (CCI), stability (STD), and complexity (TBM). The novel framework showed better predictive performance with 67.82% CCI, 30.21 MSE, 0.8 STD, and 2.62 TBM on the three stages of AIDS disease progression of 50 HIV+ individuals. This framework is an invaluable bioinformatics tool that will be useful to the clinical assessment of viral pathogenesis.

Multiagent framework for bio-data mining

Rough Sets and Knowledge Technology, Jan 1, 2009

This paper proposes to apply multiagent based data mining technologies to biological data analysi... more This paper proposes to apply multiagent based data mining technologies to biological data analysis. The rationale is justified from multiple perspectives with an emphasis on biological context. Followed by that, an initial multiagent based bio-data mining framework is presented. Based on the framework, we developed a prototype system to demonstrate how it helps the biologists to perform a comprehensive mining task for answering biological questions. The system offers a new way to reuse biological datasets and available data mining algorithms with ease.

A clustering based hybrid system for biomarker selection and sample classification of mass spectrometry data

Neurocomputing, Jan 1, 2010

A clustering based hybrid system for mass spectrometry data analysis

Pattern Recognition in Bioinformatics, Jan 1, 2008

Recently, much attention has been given to the mass spectrometry (MS) technology based disease cl... more Recently, much attention has been given to the mass spectrometry (MS) technology based disease classification, diagnosis, and protein-based biomarker identification. Similar to microarray based investigation, proteomic data generated by such kind of high-throughput experiments are often with high feature-to-sample ratio. Moreover, biological information and pattern are compounded with data noise, redundancy and outliers. Thus, the development of algorithms and procedures for the analysis and interpretation of such kind of data is of paramount importance. In this paper, we propose a hybrid system for analyzing such high dimensional data. The proposed method uses the k-mean clustering algorithm based feature extraction and selection procedure to bridge the filter selection and wrapper selection methods. The potential informative mass/charge (m/z) markers selected by filters are subject to the k-mean clustering algorithm for correlation and redundancy reduction, and a multi-objective Genetic Algorithm selector is then employed to identify discriminative m/z markers generated by k-mean clustering algorithm. Experimental results obtained by using the proposed method indicate that it is suitable for m/z biomarker selection and MS based sample classification.

A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data

BMC bioinformatics, Jan 1, 2010

Background: Feature selection techniques are critical to the analysis of high dimensional dataset... more Background: Feature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses.

An ensemble of classifiers with genetic algorithm based feature selection

IEEE Intelligent Informatics Bulletin, Jan 1, 2008

Different data classification algorithms have been developed and applied in various areas to anal... more Different data classification algorithms have been developed and applied in various areas to analyze and extract valuable information and patterns from large datasets with noise and missing values. However, none of them could consistently perform well over all datasets. To this end, ensemble methods have been suggested as the promising measures. This paper proposes a novel hybrid algorithm, which is the combination of a multi-objective Genetic Algorithm (GA) and an ensemble classifier. While the ensemble classifier, which consists of a decision tree classifier, an Artificial Neural Network (ANN) classifier, and a Support Vector Machine (SVM) classifier, is used as the classification committee, the multi-objective Genetic Algorithm is employed as the feature selector to facilitate the ensemble classifier to improve the overall sample classification accuracy while also identifying the most important features in the dataset of interest. The proposed GA-Ensemble method is tested on three benchmark datasets, and compared with each individual classifier as well as the methods based on mutual information theory, bagging and boosting. The results suggest that this GA-Ensemble method outperform other algorithms in comparison, and be a useful method for classification and feature selection problems.

An agent-based hybrid system for microarray data analysis

Intelligent Systems, IEEE, Jan 1, 2009

Hybrid methods to select informative gene sets in microarray data classification

Proceedings of the 20th Australian joint …, Jan 1, 2007

One of the key applications of microarray studies is to select and classify gene expression profi... more One of the key applications of microarray studies is to select and classify gene expression profiles of cancer and normal subjects. In this study, two hybrid approaches-genetic algorithm with decision tree (GADT) and genetic algorithm with neural network (GANN)-are utilized to select optimal gene sets which contribute to the highest classification accuracy. Two benchmark microarray datasets were tested, and the most significant disease related genes have been identified. Furthermore, the selected gene sets achieved comparably high sample classification accuracy (96.79% and 94.92% in colon cancer dataset, 98.67% and 98.05% in leukemia dataset) compared with those obtained by mRMR algorithm. The study results indicate that these two hybrid methods are able to select disease related genes and improve classification accuracy.

Improving X! Tandem on peptide identification from mass spectrometry by self-boosted Percolator

A critical component in mass spectrometry (MS)-based proteomics is an accurate protein identifica... more A critical component in mass spectrometry (MS)-based proteomics is an accurate protein identification procedure. Database search algorithms commonly generate a list of peptide-spectrum matches (PSMs). The validity of these PSMs is critical for downstream analysis since proteins that are present in the sample are inferred from those PSMs. A variety of post-processing algorithms have been proposed to validate and filter PSMs. Among them, the most popular ones include a semi-supervised learning (SSL) approach known as Percolator and an empirical modeling approach known as PeptideProphet. However, they are predominantly designed for commercial database search algorithms i.e. SEQUEST and MASCOT. Therefore, it is highly desirable to extend and optimize those PSM post-processing algorithms for open source database search algorithms such as X!Tandem. In this study, we propose a Self-boosted Percolator for post-processing X!Tandem search results. We find that the SSL algorithm utilized by Percolator depends heavily on the initial ranking of PSMs. Starting with a poor PSM ranking list may cause Percolator to perform suboptimally. By implementing Percolator in a cascade learning manner, we can progressively improve the performance through multiple boost runs, enabling many more PSM identifications without sacrificing false discovery rate (FDR).

ENSEMBLE METHODS AND HYBRID ALGORITHMS FOR COMPUTATIONAL AND SYSTEMS BIOLOGY

Modern molecular biology increasingly relies on the application of high-throughput technologies f... more Modern molecular biology increasingly relies on the application of high-throughput technologies for studying the function, interaction, and integration of genes, proteins, and a variety of other molecules on a large scale. The application of those highthroughput technologies has led to the exponential growth of biological data, making modern molecular biology a data-intensive science. Huge effort has been directed to the development of robust and efficient computational algorithms in order to make sense of these extremely large and complex biological data, giving rise to several interdisciplinary fields, such as computational and systems biology.

Re-Fraction: A Machine Learning Approach for Deterministic Identification of Protein Homolog

A key step in the analysis of mass spectrometry (MS)-based proteomics data is the inference of pr... more A key step in the analysis of mass spectrometry (MS)-based proteomics data is the inference of proteins from identified peptide sequences. Here we describe Re-Fraction, a novel machine learning algorithm that enhances deterministic protein identification. Re-Fraction utilizes several protein physical properties to assign proteins to expected protein fractions that comprise large-scale MS-based proteomics data. This information is then used to appropriately assign peptides to specific proteins. This approach is sensitive, highly specific, and computationally efficient. We provide algorithms and source code for the current version of Re-Fraction, which accepts output tables from the MaxQuant environment. Nevertheless, the principles behind Re-Fraction can be applied to other protein identification pipelines where data are generated from samples fractionated at the protein level. We demonstrate the utility of this approach through reanalysis of data from a previously published study and generate lists of proteins deterministically identified by Re-Fraction that were previously only identified as members of a protein group. We find that this approach is particularly useful in resolving protein groups composed of splice variants and homologues, which are frequently expressed in a cell-or tissue-specific manner and may have important biological consequences.

Assignment 2 (Part 1, 2): Literature Review & Research Approach Outline

Feature selection techniques are critical to the analysis of high dimensional datasets . This is ... more Feature selection techniques are critical to the analysis of high dimensional datasets . This is especially true in gene selection of microarrays because such datasets often contain a limited number of training samples but large amount of features, under the assumption that only several of which are strongly associated with the classification task while others are redundant and noisy . The challenge is how to select the most informative gene subset from the original set for classifier creation and accurate sample classification.

Sample Subset Optimization for Classifying Imbalanced Biological Data

Advances in Knowledge Discovery …, Jan 1, 2011

Data in many biological problems are often compounded by imbalanced class distribution. That is, ... more Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.

A review of ensemble methods in bioinformatics

Current …, Jan 1, 2010

Ensemble learning is an intensively studies technique in machine learning and pattern recognition... more Ensemble learning is an intensively studies technique in machine learning and pattern recognition. Recent work in computational biology has seen an increasing use of ensemble learning methods due to their unique advantages in dealing with small sample size, high-dimensionality, and complexity data structures. The aim of this article is two-fold. First, it is to provide a review of the most widely used ensemble learning methods and their application in various bioinformatics problems, including the main topics of gene expression, mass spectrometry-based proteomics, gene-gene interaction identification from genome-wide association studies, and prediction of regulatory elements from DNA and protein sequences. Second, we try to identify and summarize future trends of ensemble methods in bioinformatics. Promising directions such as ensemble of support vector machine, meta-ensemble, and ensemble based feature selection are discussed.

Multiagent Based Bio-data Mining

rp-www.cs.usyd.edu.au

This paper argues for applying multiagent based data mining technologies to biological data analy... more This paper argues for applying multiagent based data mining technologies to biological data analysis. The rationale is described from multiple perspectives with an emphasize on biological context. Followed by that, an initial multiagent based bio-data mining framework is conceived, and a prototype system is developed to demonstrate how it helps the biologists who are often unfamiliar with data mining technologies to perform a comprehensive mining task for answering biological questions. The system offers a new way to reuse biological datasets and available data mining algorithms at their fullest.

An embedded two-layer feature selection approach for microarray data analysis

IEEE intelligent informatics bulletin, Jan 1, 2011

Feature selection is an important technique in dealing with application problems with large numbe... more Feature selection is an important technique in dealing with application problems with large number of variables and limited training samples, such as image processing, combinatorial chemistry, and microarray analysis. Commonly employed feature selection strategies can be divided into filter and wrapper. In this study, we propose an embedded two-layer feature selection approach to combining the advantages of filter and wrapper algorithms while avoiding their drawbacks. The hybrid algorithm, called GAEF (Genetic Algorithm with embedded filter), divides the feature selection process into two stages. In the first stage, Genetic Algorithm (GA) is employed to pre-select features while in the second stage a filter selector is used to further identify a small feature subset for accurate sample classification. Three benchmark microarray datasets are used to evaluate the proposed algorithm. The experimental results suggest that this embedded two-layer feature selection strategy is able to improve the stability of the selection results as well as the sample classification accuracy.

Agent-Based Hybrid Approach for Bioinformatics

computer.org

As Edward Keedwell and Ajit Narayana 2 point out, hybrid approaches are useful for solving variou... more

A dynamic wavelet-based algorithm for pre-processing tandem mass spectrometry data

Bioinformatics, Jan 1, 2010

Mass spectrometry (MS)-based proteomics is one of the most commonly used research techniques for ... more Mass spectrometry (MS)-based proteomics is one of the most commonly used research techniques for identifying and characterizing proteins in biological and medical research. The identification of a protein is the critical first step in elucidating its biological function. Successful protein identification depends on various interrelated factors, including effective analysis of MS data generated in a proteomic experiment. This analysis comprises several stages, often combined in a pipeline or workflow. The first component of the analysis is known as spectra pre-processing. In this component, the raw data generated by the mass spectrometer is processed to eliminate noise and identify the mass-to-charge ratio (m/z) and intensity for the peaks in the spectrum corresponding to the presence of certain peptides or peptide fragments. Since all downstream analyses depend on the pre-processed data, effective pre-processing is critical to protein identification and characterization. There is a critical need for more robust pre-processing algorithms that perform well on tandem mass spectra under a variety of different conditions and can be easily integrated into sophisticated data analysis pipelines for practical wet-lab applications. We have developed a new pre-processing algorithm. Based on wavelet theory, our method uses a dynamic peak model to identify peaks. It is designed to be easily integrated into a complete proteomic analysis workflow. We compared the method with other available algorithms using a reference library of raw MS and tandem MS spectra with known protein composition information. Our pre-processing algorithm results in the identification of significantly more peptides and proteins in the downstream analysis for a given false discovery rate. Software available at: http://www.maths.usyd.edu.au/u/penghao/index.html.

Genetic algorithm-based multi-objective optimisation for QoS-aware web services composition

Knowledge Science, Engineering …, Jan 1, 2011

Finding an optimal solution for QoS-aware Web service composition with various restrictions on qu... more Finding an optimal solution for QoS-aware Web service composition with various restrictions on qualities is a multi-objective optimisation problem. A popular multi-objective genetic algorithm, NSGA-II, is studied in order to provide a set of optimal solutions for QoS-based service composition. Experiments with different numbers of abstract and concrete services confirm the expected behaviour of the algorithm.

A particle swarm based hybrid system for imbalanced medical data sampling

BMC genomics, Jan 1, 2009

Background: Medical and biological data are commonly with small sample size, missing values, and ... more Background: Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.

Hierarchical kernel mixture models for the prediction of AIDS disease progression using HIV structural gp120 profiles

BMC …, Jan 1, 2010

Changes to the glycosylation profile on HIV gp120 can influence viral pathogenesis and alter AIDS... more Changes to the glycosylation profile on HIV gp120 can influence viral pathogenesis and alter AIDS disease progression. The characterization of glycosylation differences at the sequence level is inadequate as the placement of carbohydrates is structurally complex. However, no structural framework is available to date for the study of HIV disease progression. In this study, we propose a novel machine-learning based framework for the prediction of AIDS disease progression in three stages (RP, SP, and LTNP) using the HIV structural gp120 profile. This new intelligent framework proves to be accurate and provides an important benchmark for predicting AIDS disease progression computationally. The model is trained using a novel HIV gp120 glycosylation structural profile to detect possible stages of AIDS disease progression for the target sequences of HIV + individuals. The performance of the proposed model was compared to seven existing different machine-learning models on newly proposed gp120-Benchmark_1 dataset in terms of error-rate (MSE), accuracy (CCI), stability (STD), and complexity (TBM). The novel framework showed better predictive performance with 67.82% CCI, 30.21 MSE, 0.8 STD, and 2.62 TBM on the three stages of AIDS disease progression of 50 HIV+ individuals. This framework is an invaluable bioinformatics tool that will be useful to the clinical assessment of viral pathogenesis.

Multiagent framework for bio-data mining

Rough Sets and Knowledge Technology, Jan 1, 2009

This paper proposes to apply multiagent based data mining technologies to biological data analysi... more This paper proposes to apply multiagent based data mining technologies to biological data analysis. The rationale is justified from multiple perspectives with an emphasis on biological context. Followed by that, an initial multiagent based bio-data mining framework is presented. Based on the framework, we developed a prototype system to demonstrate how it helps the biologists to perform a comprehensive mining task for answering biological questions. The system offers a new way to reuse biological datasets and available data mining algorithms with ease.

A clustering based hybrid system for biomarker selection and sample classification of mass spectrometry data

Neurocomputing, Jan 1, 2010

A clustering based hybrid system for mass spectrometry data analysis

Pattern Recognition in Bioinformatics, Jan 1, 2008

Recently, much attention has been given to the mass spectrometry (MS) technology based disease cl... more Recently, much attention has been given to the mass spectrometry (MS) technology based disease classification, diagnosis, and protein-based biomarker identification. Similar to microarray based investigation, proteomic data generated by such kind of high-throughput experiments are often with high feature-to-sample ratio. Moreover, biological information and pattern are compounded with data noise, redundancy and outliers. Thus, the development of algorithms and procedures for the analysis and interpretation of such kind of data is of paramount importance. In this paper, we propose a hybrid system for analyzing such high dimensional data. The proposed method uses the k-mean clustering algorithm based feature extraction and selection procedure to bridge the filter selection and wrapper selection methods. The potential informative mass/charge (m/z) markers selected by filters are subject to the k-mean clustering algorithm for correlation and redundancy reduction, and a multi-objective Genetic Algorithm selector is then employed to identify discriminative m/z markers generated by k-mean clustering algorithm. Experimental results obtained by using the proposed method indicate that it is suitable for m/z biomarker selection and MS based sample classification.

A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data

BMC bioinformatics, Jan 1, 2010

Background: Feature selection techniques are critical to the analysis of high dimensional dataset... more Background: Feature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses.

An ensemble of classifiers with genetic algorithm based feature selection

IEEE Intelligent Informatics Bulletin, Jan 1, 2008

Different data classification algorithms have been developed and applied in various areas to anal... more Different data classification algorithms have been developed and applied in various areas to analyze and extract valuable information and patterns from large datasets with noise and missing values. However, none of them could consistently perform well over all datasets. To this end, ensemble methods have been suggested as the promising measures. This paper proposes a novel hybrid algorithm, which is the combination of a multi-objective Genetic Algorithm (GA) and an ensemble classifier. While the ensemble classifier, which consists of a decision tree classifier, an Artificial Neural Network (ANN) classifier, and a Support Vector Machine (SVM) classifier, is used as the classification committee, the multi-objective Genetic Algorithm is employed as the feature selector to facilitate the ensemble classifier to improve the overall sample classification accuracy while also identifying the most important features in the dataset of interest. The proposed GA-Ensemble method is tested on three benchmark datasets, and compared with each individual classifier as well as the methods based on mutual information theory, bagging and boosting. The results suggest that this GA-Ensemble method outperform other algorithms in comparison, and be a useful method for classification and feature selection problems.

An agent-based hybrid system for microarray data analysis

Intelligent Systems, IEEE, Jan 1, 2009

Hybrid methods to select informative gene sets in microarray data classification

Proceedings of the 20th Australian joint …, Jan 1, 2007

One of the key applications of microarray studies is to select and classify gene expression profi... more One of the key applications of microarray studies is to select and classify gene expression profiles of cancer and normal subjects. In this study, two hybrid approaches-genetic algorithm with decision tree (GADT) and genetic algorithm with neural network (GANN)-are utilized to select optimal gene sets which contribute to the highest classification accuracy. Two benchmark microarray datasets were tested, and the most significant disease related genes have been identified. Furthermore, the selected gene sets achieved comparably high sample classification accuracy (96.79% and 94.92% in colon cancer dataset, 98.67% and 98.05% in leukemia dataset) compared with those obtained by mRMR algorithm. The study results indicate that these two hybrid methods are able to select disease related genes and improve classification accuracy.