ICT513 Group Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

ICT 513

DATA ANALYTICS

GROUP PROJECT
REPORT
ON
Analysis of Breast Cancer Prognosis

Student(Student No.) %Contribution Signature


Rabjot Kaur (34517096) 50% Rabjot Kaur
Muhammad Hammad Anwar 50% Muhammad Hammad
(34548635) Anwar
Total 100%
1. Introduction:
Breast cancer is a prevalent and difficult-to-treat cancer that impacts 1 out of every 7 women
in Australia. Advancements in detection and treatment have increased survival rates recently,
but individuals with triple-negative breast cancer have a five-year survival rate of just 77%.
This study explores the possibility of forecasting patient outcomes using clinical,
pathological, and gene expression information to pinpoint aggressive cases and offer
customized treatment strategies.
This analysis is guided by three main questions:
(a) Gene expression data can be expensive to collect. In the non- research world, this cost
must be born by the patient. How well can common clinico-pathological variables help to
classify an individual’s chance of being distant metastasis free after 5 years?
(b) Do the data show a clear link between breast cancer survival and hormone receptor status
(ER/PR/HER2)?
(c) Does combining the traditional variables with the gene expression data improve the
ability to predict patient prognosis?

2. Methods and Analysis:


2.1 Notation and Subsetting:
In this report, the dataset's variable names are marked with italics for better understanding.
E02_Event_DMFS_2005 is known as DMFS Event, while Cl03_Size_.mm is identified as
Tumor Size.
Data cleaning included filling in missing values through imputation, utilizing the median for
numerical data and mode for categorical data. Once all missing values were confirmed to be
filled in, the dataset was divided into subsets to address certain analysis questions, such as
clinico-pathological factors by themselves, hormone receptor status, and comprehensive data
incorporating gene expression elements.

The dataset was split into different subsets to tackle various research questions, with clinico-
pathological and gene expression data treated differently depending on their importance and
modeling needs.
Question (a)- Clinico-Pathological Data: We chose pertinent factors like Age, Tumor Size,
Lymph Node Positivity, and Tumor Grade to analyze the prognosis of metastasis-free survival
(DMFS) based on clinico-pathological variables.
Question (b)- Hormone Receptor Data: To study the correlation between hormone receptor
status (ER, PR, HER2) and survival, a specific group of variables centered around these
hormone receptors was utilized.
In order to assess improved prognostic prediction, question (c) necessitated a thorough
dataset that integrates clinicopathological factors and gene expression information.

R was utilized for the analysis, utilizing packages such as MASS for linear discriminant
analysis (LDA), ggplot2 for visualization, caret for data handling, rpart for decision tree and
dplyr for effective data manipulation. This systematic method guaranteed uniformity in
selecting variables and reproducibility in various analyses.

2.2 Overview of Data


The breast cancer analysis dataset consists of 295 patient records from the Netherlands
Cancer Institute, including clinico-pathological and gene expression information. Clinico-
pathological factors consist of factors such as Age, Tumor Size, Lymph Node Positivity, and
Tumor Grade, and the statuses of hormone receptors (ER, PR, HER2), in addition to the
outcome of metastasis-free survival (DMFS). The dataset contains 105 gene expression
variables, offering molecular details on the tumor profile of every patient.
Clinico-Pathological Analysis: LDA was used on this subset to evaluate classification
effectiveness with conventional clinical factors.
Hormone Receptor Analysis: LDA was used on this simplified dataset to assess the
predictive power of hormone receptor status for DMFS outcomes.
Combined analysis with gene expression data: PCA was first used on the gene expression
data to reduce it to the main components that account for 80% of the variance. Next, the PCA
components were merged with the clinico-pathological data. Both the combined dataset was
analyzed using both LDA (with PCA) and Decision Tree models, demonstrating the potential
of integrating traditional and molecular data for prognosis.

2.2.1 Method for part (a)

The chosen model for predicting the probability of being free of distant metastasis after five
years, using only clinico-pathological variables, is Linear Discriminant Analysis (LDA), as
follows:

E02_Event_DMFS_2005∼Cl01_Age+Cl02_pN_pos+Cl03_Size_.mm.
+H02A_Type2007+H03_ANGIOINV2001+H04_centralcoll+H05_matrix+H06_necrosis+H08D_gra
de_07+H09A_lymphinf2005

LDA was selected for its capability to categorize results by using linear combinations of
predictor variables, with the goal of increasing the distinction between predefined groups,
specifically patients with and without distant metastasis.

Probabilities were determined for each class before based on the dataset's distribution. The
model produced average values and linear discriminant coefficients for each variable,
offering understanding of their role in differentiating metastasis-free (69.49%) and
metastasis-present (30.5%) patients.
Group means were calculated post-model fitting to compare characteristics between
metastasis-free and metastasis-present groups. Metastasis-present patients had higher Size
and Grade values. The LD1 coefficients showed H08D_grade_07 (Tumor Grade) had a
strong positive coefficient (1.8), indicating a significant association with the metastasis-
present class.
Figure 1: Stacked histograms of the linear discriminant scores for metastasis-free (top) and
metastasis-present (bottom) groups.

Moreover, histograms (Figure 1) were created for the linear discriminants within each group
to visually assess the effectiveness of the LDA in achieving separation. Group 0 has scores
tightly centered around 0, while group 1 has a slightly wider spread. The overlap suggests
potential misclassification areas, indicating that clinical variables may not fully distinguish
metastasis outcomes accurately.
The predictions were confirmed by analyzing a confusion matrix and determining hit and
misclassification rates to evaluate the model's performance.

2.2.2 Method for part(b)


A Linear Discriminant Analysis (LDA) model was utilized to investigate the correlation
between breast cancer survival and hormone receptor status. ER Status, PR Status, and HER2
Status was chosen as predictors for the target variable E02_Event_DMFS_2005 (DMFS
Event), which indicates if a patient remained free of distant metastasis after five years.
The representation of the model is as follows:

E02_Event_DMFS_2005∼ I01_ER_2007 + I02_PR_2007 + I03_HER2_2007

The data was fitted using the MASS package in R, with prior probabilities adjusted to reflect
the class distribution (69.49% for metastasis-free and 30.51% for metastasis-present) and
account for the class imbalance.

Group averages were computed for each hormone receptor variable, revealing discrepancies
between the two groups. For example, the metastasis-present group showed a higher average
of HER2 Status, indicating a potential connection to metastasis. The coefficient for the linear
discriminant (LD1) emphasizes the impact of each variable on group separation, with the
HER2 Status having the highest positive coefficient (83.09%), signifying a significant
correlation with the class exhibiting metastasis.

The model was utilized to predict discriminant scores, and histograms were created for each
group. The histograms depicted in Figure 2 illustrate how scores are spread out between the
two classes. There is a clear distinction, yet there are some instances of overlap indicating
potential misclassification areas for the model.
Figure 2: Stacked histogram of LD1 of hormone receptor status against distant metastasis (0 and
1).

To further assess the model's performance, a confusion matrix was created using the initial
linear discriminant function (LD1), and the accuracy and error rates were calculated. These
metrics offer a numerical evaluation of how well the model can differentiate between
outcomes with and without metastasis, depending on hormone receptor status.

2.2.3 Method for part(c)

(i) Linear Discriminant Analysis (LDA) with PCA:

The integrated model for predicting distant metastasis-free survival (DMFS Event) combines
conventional clinico-pathological factors with principal components extracted from gene
expression data. The formula of the model is:

E02_Event_DMFS_2005∼Cl01_Age+Cl02_pN_pos+Cl03_Size_.mm.
+H02A_Type2007+H03_ANGIOINV2001+H04_centralcoll+H05_matrix+H06_necrosis+H08D_gra
de_07+H09A_lymphinf2005+
PC1+PC2+PC3+……………….+PC39

Initially, PCA was employed to decrease the dimension of the gene expression data, and then
the chosen principal components were merged with clinical and pathological variables in the
LDA model to assess the accuracy of classification.
Figure 3: Stacked histogram of first linear discriminant of clinico-pathological variables and gene
expression data against distant metastasis (0 and 1).

Stacked histogram (Figure 3) shows linear discriminant score distribution for two groups:
metastasis-free (group 0) and metastasis-present (group 1). Scores center around 0 with
separation between groups but overlap, indicating potential misclassification despite
improved discriminatory power from combined data.

The scree plot (Figure 4) displays variance explained by principal components in gene
expression data, emphasizing few initial components capturing substantial variance and
guiding selection for model inclusion.

Figure 4: Scree plot of variance accounted by each PC.

(ii) Decision tree:

A decision tree model was built using a combined dataset to assess whether integrating
clinical and pathological factors with gene expression data improves patient prognosis
prediction (DMFS Event). The model combined clinical and pathological factors with
principal components from gene expression data to summarize genetic information more
effectively. The rpart package in R was used to construct the decision tree, with the target
variable focusing on distant metastasis-free survival.

Figure 5: Initial and Pruned Decision tree for combined data to predict distant metastasis.

At first, a complete decision tree was created to investigate how traditional clinical factors
and gene expression data interacted (Figure 5). The tree was pruned using the optimal
complexity parameter determined through cross-validation to enhance model interpretability
and prevent overfitting. This process of pruning simplified the model by eliminating less
important branches while keeping key predictors intact. The accuracy of the pruned decision
tree was evaluated through a confusion matrix, enabling the calculation of hit rate and
misclassification rate to directly assess if the merged data enhanced the model's predictive
capability which is discussed in result section.

3. Results and discussion:

3.1 Part a
The linear discriminants of the model revealed that Tumor Grade and Lymph Node Positivity
involvement had higher coefficients, suggesting a significant contribution to predicting
metastasis results. Using the identical dataset, forecasts had a success rate of 72.88%, along
with an error rate of 27.12%, showcasing a decent level of precision when utilizing clinico-
pathological factors for predicting outcomes.

Figure 1: Hit rate and misclassification rate of LD1

of clinico-pathological variables against distant metastasis.

Confusion matrix
Actual
0 1
Predicted 0 187 62
1 18 28

The confusion matrix indicated that 187 individuals were accurately identified as free of
metastasis, while 62 were incorrectly labeled. Out of the individuals with metastasis, 28 were
accurately categorized, while 18 were mistakenly categorized. The linear discriminants'
histograms showed a distinction between the two groups with some overlap, indicating that
although clinico-pathological data is helpful for discrimination, better classification accuracy
could be achieved by refining or adding more data types.
These results highlight clinico-pathological data alone may not be sufficient for accurate
metastasis outcome predictions, suggesting a need for additional data sources or complex
models.

3.2 Part b
The model's linear discriminants showed that HER2 status had a significantly high
coefficient, suggesting a powerful role in predicting distant metastasis outcomes depending
on hormone receptor status. Forecasts on the identical dataset resulted in a 69.49% success
rate and a 30.51% error rate, showing decent accuracy in employing hormone receptor
information for metastasis classification.

Figure 2: Hit rate and misclassification rate of hormone receptor status against distant metastasis.

Confusion matrix
Actual
0 1
Predicted 0 205 90
1 0 0

The confusion matrix indicates that 205 people were accurately detected as not having
metastasis, whereas 90 were incorrectly labeled as having metastasis. All patients, except for
one, were correctly classified as either having metastasis or not. Graphs of the linear
discriminants show a certain level of distinction between the two categories, but there is also
some overlap. This indicates that although hormone receptor information can help with
categorization, its precision is restricted.

3.3 Part c
(i) Linear Discriminant Analysis (LDA) with PCA:
Combining LDA with PCA improved classification accuracy moderately. Scree plot
(Figure 4) reveals early components explain significant variance, enabling
dimensionality reduction. By utilizing these components and clinico-pathological
variables, the model obtained a 75.24% hit rate with a 24.76% misclassification rate.

Confusion matrix (LDA with PCA)


Actual
0 1
Predicted 0 181 49
1 24 41
Of 181 metastasis-free cases, 41 metastasis-present cases were correctly identified.
However, 49 metastasis-present cases were misclassified as metastasis-free, and 24
metastasis-free cases were incorrectly labeled.
(ii) Decision Tree:
Using combined data, an initial decision tree model included clinico-pathological
variables and gene expression principal components. After pruning, the decision
tree was simplified for improved interpretability and reduced overfitting. The
pruned model achieved 82.71% accuracy and 17.29% misclassification rate.

Confusion matrix (Decision Tree)


Actual
0 1
Predicted 0 189 35
1 16 55

The pruned decision tree accurately identified 189 cases without metastasis and 55
cases with metastasis. There were fewer misclassifications in comparison to the
LDA model, with 35 cases of metastasis being mistakenly labeled as metastasis-
free and 16 cases of metastasis-free being classified as metastasis-present.

The decision tree shows combining gene data with clinical variables boosts
predictive accuracy. Pruned tree simplifies model without losing performance,
emphasizes Tumor Size and principal components for prognosis prediction.
4. Conclusion:

The findings of this study indicate that clinico-pathological factors by themselves have only a
moderate ability to predict distant metastasis-free survival and are not very effective at
distinguishing between different metastasis outcomes. Part (a) showed that while Linear
Discriminant Analysis (LDA) can somewhat distinguish between patients with and without
distant metastasis, it may not provide the optimal model.

In section (b), the addition of hormone receptor statuses (ER, PR, HER2) slightly enhanced
the prediction accuracy, emphasizing the important impact of HER2 on prognosis.
Nevertheless, relying solely on hormone receptor data proved inadequate for achieving
precise forecasts.

Part (c) demonstrated that the integration of clinico-pathological factors with gene expression
information greatly improves the accuracy of predictions. The LDA with PCA model and the
decision tree model, especially post-pruning, both showed the importance of incorporating
gene expression data for precise forecasts on metastasis-free survival. This highlights the
importance of gene expression data for accurate and reliable prognosis predictions, in
addition to clinico-pathological and hormone receptor data which offer fundamental insights.

You might also like