C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
Welcome to the final assignment in Course 2! In this assignment you’ll develop risk models using
survival data and a combination of linear and non-linear techniques. We’ll be using a dataset with
survival data of patients with Primary Biliary Cirrhosis (pbc). PBC is a progressive disease of the
liver caused by a buildup of bile within the liver (cholestasis) that results in damage to the small
bile ducts that drain bile from the liver. Our goal will be to understand the effects of different
factors on the survival times of the patients. Along the way you’ll learn about the following topics:
• Cox Proportional Hazards
– Data Preprocessing for Cox Models.
• Random Survival Forests
– Permutation Methods for Interpretation.
1.1 Outline
• Section ??
• Section ??
• Section ??
• Section ??
– Section ??
• Section ??
• Section ??
– Section ??
• Section ??
– Section ??
• Section ??
• Section ??
## 1. Import Packages
We’ll first import all the packages that we need for this assignment.
• sklearn is one of the most popular machine learning libraries.
• numpy is the fundamental package for scientific computing in python.
• pandas is what we’ll use to manipulate our data.
• matplotlib is a plotting library.
• lifelines is an open-source survival analysis library.
1
[22]: import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
(258, 19)
[24]: time status trt age sex ascites hepato spiders edema \
0 1.095890 1.0 0.0 58.765229 0.0 1.0 1.0 1.0 1.0
1 12.328767 0.0 0.0 56.446270 0.0 0.0 1.0 1.0 0.0
2 2.772603 1.0 0.0 70.072553 1.0 0.0 0.0 0.0 0.5
3 5.273973 1.0 0.0 54.740589 0.0 0.0 1.0 1.0 0.5
6 5.019178 0.0 1.0 55.534565 0.0 0.0 1.0 0.0 0.0
stage
0 4.0
1 3.0
2
2 4.0
3 4.0
6 3.0
Now, split your dataset into train, validation and test set using 60/20/20 split.
[26]: np.random.seed(0)
df_dev, df_test = train_test_split(df, test_size = 0.2)
df_train, df_val = train_test_split(df_dev, test_size = 0.25)
3
[27]: continuous_columns = ['age', 'bili', 'chol', 'albumin', 'copper', 'alk.phos',␣
,→'ast', 'trig', 'platelet', 'protime']
Let’s check the summary statistics on our training dataset to make sure it’s standardized.
[28]: df_train.loc[:, continuous_columns].describe()
TX
λ(t, x) = λ0 (t)eθ i
The λ0 term is a baseline hazard and incorporates the risk over time, and the other term incorporates
the risk due to the individual’s covariates. After fitting the model, we can rank individuals using
T
the person-dependent risk term eθ Xi .
4
Categorical variables cannot be used in a regression model as they are. In order to use them,
conversion to a series of variables is required.
Since our data has a mix of categorical (stage) and continuous (wblc) variables, before we proceed
further we need to do some data engineering. To tackle the issue at hand we’ll be using the Dummy
Coding technique. In order to use Cox Proportional Hazards, we will have to turn the categorical
data into one hot features so that we can fit our Cox model. Luckily, Pandas has a built-in function
called get_dummies that will make it easier for us to implement our function. It turns categorical
features into multiple binary features.
### Exercise 1 In the cell below, implement the to_one_hot(...) function.
Hints
Remember to drop the first dummy for each each category to avoid convergence issues when fitting
the proportional hazards model.
Check out the get_dummies() documentation.
Use dtype=np.float64.
[29]: # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def to_one_hot(dataframe, columns):
'''
Convert columns in dataframe to one-hot encoding.
Args:
dataframe (dataframe): pandas dataframe containing covariates
columns (list of strings): list categorical column names to one hot␣
,→encode
Returns:
one_hot_df (dataframe): dataframe with categorical columns encoded
as binary variables
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
return one_hot_df
Now we’ll use the function you coded to transform the training, validation, and test sets.
[30]: # List of categorical columns
to_encode = ['edema', 'stage']
5
print(one_hot_val.columns.tolist())
print(f"There are {len(one_hot_val.columns)} columns")
Expected output
['time', 'status', 'trt', 'age', 'sex', 'ascites', 'hepato', 'spiders', 'bili', 'chol', 'albumi
There are 22 columns
Now, let’s take a peek at one of the transformed data sets. Do you notice any new features?
[31]: print(one_hot_train.shape)
one_hot_train.head()
(154, 22)
[5 rows x 22 columns]
6
Run the following cell to fit your Cox Proportional Hazards model using the lifelines package.
[32]: cph = CoxPHFitter()
cph.fit(one_hot_train, duration_col = 'time', event_col = 'status', step_size=0.
,→1)
You can use cph.print_summary() to view the coefficients associated with each covariate as well
as confidence intervals.
[33]: cph.print_summary()
<IPython.core.display.HTML object>
Question:
• According to the model, was treatment trt beneficial?
• What was its associated hazard ratio?
– Note that the hazard ratio is how much an incremental increase in the feature variable
changes the hazard.
Check your answer!
You should see that the treatment (trt) was beneficial because it has a negative impact on the
hazard (the coefficient is negative, and exp(coef) is less than 1).
The associated hazard ratio is ~0.8, because this is the exp(coef) of treatment.
We can compare the predicted survival curves for treatment variables. Run the next cell to plot
survival curves using the plot_covariate_groups() function. - The y-axis is th survival rate -
The x-axis is time
[34]: cph.plot_covariate_groups('trt', values=[0, 1]);
7
Notice how the group without treatment has a lower survival rate at all times (the x-axis is time)
compared to the treatment group.
## 6. Hazard Ratio
Recall from the lecture videos that the Hazard Ratio between two patients was the likelihood of
one patient (e.g smoker) being more at risk than the other (e.g non-smoker).
λsmoker (t)
= eθ(Xsmoker −Xnonsmoker )
T
λnonsmoker (t)
Where
T
λsmoker (t) = λ0 (t)eθXsmoker
and
T
λnonsmoker (t) = λ0 (t)eθXnonsmoker
### Exercise 2 In the cell below, write a function to compute the hazard ratio between two
individuals given the cox model’s coefficients.
Hints
use numpy.dot
use nump.exp
[35]: # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def hazard_ratio(case_1, case_2, cox_params):
'''
8
Return the hazard ratio of case_1 : case_2 using
the coefficients of the cox model.
Args:
case_1 (np.array): (1 x d) array of covariates
case_2 (np.array): (1 x d) array of covariates
model (np.array): (1 x d) array of cox model coefficients
Returns:
hazard_ratio (float): hazard ratio of case_1 : case_2
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
hr = np.exp(cox_params.dot((case_1 - case_2).T))
return hr
j = 5
case_2 = one_hot_train.iloc[j, :].drop(['time', 'status'])
15.029017732492221
Expected Output:
15.029017732492221
Question:
Is case_1 or case_2 at greater risk?
Check your answer!
Important! The following answer only applies if you picked i = 1 and j = 5
You should see that case_1 is at higher risk.
The hazard ratio of case 1 / case 2 is greater than 1, so case 1 had a higher hazard relative to case
2
Inspect different pairs, and see if you can figure out which patient is more at risk.
9
[37]: i = 4
case_1 = one_hot_train.iloc[i, :].drop(['time', 'status'])
j = 7
case_2 = one_hot_train.iloc[j, :].drop(['time', 'status'])
Case 1
trt 0.000000
age 0.563645
sex 0.000000
ascites 0.000000
hepato 1.000000
spiders 1.000000
bili -0.405645
chol -0.210436
albumin 1.514297
copper -0.481961
alk.phos 2.173526
ast -0.144699
trig -0.531125
platelet -0.450972
protime -0.139881
edema_0.5 0.000000
edema_1.0 0.000000
stage_2.0 0.000000
stage_3.0 1.000000
stage_4.0 0.000000
Name: 1, dtype: float64
Case 2
trt 0.000000
age 0.463447
sex 0.000000
ascites 0.000000
hepato 1.000000
spiders 0.000000
bili -0.489581
chol -0.309875
albumin -1.232371
copper -0.504348
10
alk.phos 2.870427
ast -0.936261
trig -0.150229
platelet 3.190823
protime -0.139881
edema_0.5 0.000000
edema_1.0 0.000000
stage_2.0 0.000000
stage_3.0 0.000000
stage_4.0 1.000000
Name: 38, dtype: float64
Scenario 1
• A was censored at time tA
• B died at tB
• tA < tB .
Because of censoring, we can’t say whether A or B should have a higher risk score.
11
Note that in this case, being censored at time t means that the true death time was some time
AFTER time t and not at t. - Therefore if tA = tB and A was censored: - Then A actually lived
longer than B. - This will effect how you deal with ties in the exercise below!
### Exercise 3 Fill in the function below to compute Harrel’s C-index.
Hints
If you get a division by zero error, consider checking how you count when a pair is permissible (in
the case where one patient is censored and the other is not censored).
Args:
y_true (array): array of true event times
scores (array): model risk scores
event (array): indicator, 1 if event occurred at that index, 0 for␣
,→censorship
Returns:
result (float): C-index metric
'''
n = len(y_true)
assert (len(scores) == n and len(event) == n)
concordant = 0.0
permissible = 0.0
ties = 0.0
result = 0.0
### START CODE HERE (REPLACE INSTANCES OF 'None' and 'pass' with your code)␣
###
,→
12
# check if scores are tied
if scores[i] == scores[j]:
ties += 1.0
if event[i] == 0:
censored = i
uncensored = j
# check if permissible
# Note: in this case, we are assuming that censored at a␣
time
,→
# means that you did NOT die at that time. That is, if you
# live until time 30 and have event = 0, then you lived␣
THROUGH
,→
# time 30.
if y_true[uncensored] <= y_true[censored]:
permissible += 1.0
13
return result
# Case 1
event = [1, 1, 1, 1]
scores = [0.5, 0.9, 0.1, 1.0]
print("Case 1")
print("Expected: 1.0, Output: {}".format(harrell_c(y_true, scores, event)))
# Case 2
scores = [0.9, 0.5, 1.0, 0.1]
print("\nCase 2")
print("Expected: 0.0, Output: {}".format(harrell_c(y_true, scores, event)))
# Case 3
event = [1, 0, 1, 1]
scores = [0.5, 0.9, 0.1, 1.0]
print("\nCase 3")
print("Expected: 1.0, Output: {}".format(harrell_c(y_true, scores, event)))
# Case 4
y_true = [30, 30, 20, 20]
event = [1, 0, 1, 0]
scores = [10, 5, 15, 20]
print("\nCase 4")
print("Expected: 0.75, Output: {}".format(harrell_c(y_true, scores, event)))
# Case 5
y_true = list(reversed([30, 30, 30, 20, 20]))
event = [0, 1, 0, 1, 0]
scores = list(reversed([15, 10, 5, 15, 20]))
print("\nCase 5")
print("Expected: 0.583, Output: {}".format(harrell_c(y_true, scores, event)))
# Case 6
y_true = [10,10]
event = [0,1]
scores = [4,5]
print("\nCase 6")
print(f"Expected: 1.0 , Output:{harrell_c(y_true, scores, event):.4f}")
Case 1
Expected: 1.0, Output: 1.0
14
Case 2
Expected: 0.0, Output: 0.0
Case 3
Expected: 1.0, Output: 1.0
Case 4
Expected: 0.75, Output: 0.75
Case 5
Expected: 0.583, Output: 0.5833333333333334
Case 6
Expected: 1.0 , Output:1.0000
Now use the Harrell’s C-index function to evaluate the cox model on our data sets.
[40]: # Train
scores = cph.predict_partial_hazard(one_hot_train)
cox_train_scores = harrell_c(one_hot_train['time'].values, scores.values,␣
,→one_hot_train['status'].values)
# Validation
scores = cph.predict_partial_hazard(one_hot_val)
cox_val_scores = harrell_c(one_hot_val['time'].values, scores.values,␣
,→one_hot_val['status'].values)
# Test
scores = cph.predict_partial_hazard(one_hot_test)
cox_test_scores = harrell_c(one_hot_test['time'].values, scores.values,␣
,→one_hot_test['status'].values)
print("Train:", cox_train_scores)
print("Val:", cox_val_scores)
print("Test:", cox_test_scores)
Train: 0.8265139116202946
Val: 0.8544776119402985
Test: 0.8478543563068921
What do these values tell us ?
## 8. Random Survival Forests
This performed well, but you have a hunch you can squeeze out better performance by using a
machine learning approach. You decide to use a Random Survival Forest. To do this, you can use
the RandomForestSRC package in R. To call R function from Python, we’ll use the r2py package.
Run the following cell to import the necessary requirements.
[41]: %load_ext rpy2.ipython
%R require(ggplot2)
15
from rpy2.robjects.packages import importr
# import R's "base" package
base = importr('base')
[43]: print(model)
Finally, let’s evaluate on our validation and test sets, and compare it with our Cox model.
16
[44]: result = R.predict(model, newdata=df_val)
scores = np.array(result.rx('predicted')[0])
y = np.arange(len(vimps))
plt.barh(y, np.abs(vimps))
plt.yticks(y, df_train.drop(['time', 'status'], axis=1).columns)
plt.title("VIMP (absolute value)")
plt.show()
17
1.1.2 Question:
How does the variable importance compare to that of the Cox model? Which variable is important
in both models? Which variable is important in the random survival forest but not in the Cox
model? You should see that edema is important in both the random survival forest and the Cox
model. You should also see that bili is important in the random survival forest but not the Cox
model .
2 Congratulations!
You’ve finished the last assignment in course 2! Take a minute to look back at the analysis you’ve
done over the last four assignments. You’ve done a great job!
18