Predicting Cervical Cancer Using Machine Learning: Project Report Submitted To Raghupathi Cavale
Predicting Cervical Cancer Using Machine Learning: Project Report Submitted To Raghupathi Cavale
Predicting Cervical Cancer Using Machine Learning: Project Report Submitted To Raghupathi Cavale
Bachelor of Technology
in
This dataset focuses on the prediction of indicators/diagnosis of cervical cancer. The features
cover demographic information, habits, and historic medical records.
The dataset was collected at 'Hospital Universitario de Caracas' in Caracas, Venezuela. The
dataset comprises demographic information, habits, and historic medical records of 858
patients. Several patients decided not to answer some of the questions because of privacy
concerns (missing values).
About 11,000 new cases of invasive cervical cancer are diagnosed each year in the U.S.
However, the number of new cervical cancer cases has been declining steadily over the past
decades. Although it is the most preventable type of cancer, each year cervical cancer kills
about 4,000 women in the U.S. and about 300,000 women worldwide. In the United States,
cervical cancer mortality rates plunged by 74% from 1955 - 1992 thanks to increased
screening and early detection with the Pap test. AGE Fifty percent of cervical cancer diagnoses
occur in women ages 35 - 54, and about 20% occur in women over 65 years of age. The median
age of diagnosis is 48 years. About 15% of women develop cervical cancer between the ages
of 20 - 30. Cervical cancer is extremely rare in women younger than age 20. However, many
young women become infected with multiple types of human papilloma virus, which then can
increase their risk of getting cervical cancer in the future. Young women with early abnormal
changes who do not have regular examinations are at high risk for localized cancer by the
time they are age 40, and for invasive cancer by age 50.
SOCIOECONOMIC AND ETHNIC FACTORS: Although the rate of cervical cancer has declined
among both Caucasian and African-American women over the past decades, it remains much
more prevalent in African-Americans -- whose death rates are twice as high as Caucasian
women. Hispanic American women have more than twice the risk of invasive cervical cancer
as Caucasian women, also due to a lower rate of screening. These differences, however, are
almost certainly due to social and economic differences. Numerous studies report that high
poverty levels are linked with low screening rates. In addition, lack of health insurance, limited
transportation, and language difficulties hinder a poor woman’s access to screening services.
HIGH SEXUAL ACTIVITY: Human papilloma virus (HPV) is the main risk factor for cervical
cancer. In adults, the most important risk factor for HPV is sexual activity with an infected
person. Women most at risk for cervical cancer are those with a history of multiple sexual
partners, sexual intercourse at age 17 years or younger, or both. A woman who has never
been sexually active has a very low risk for developing cervical cancer. Sexual activity with
multiple partners increases the likelihood of many other sexually transmitted infections
(chlamydia, gonorrhea, syphilis).Studies have found an association between chlamydia and
cervical cancer risk, including the possibility that chlamydia may prolong HPV infection.
FAMILY HISTORY: Women have a higher risk of cervical cancer if they have a first-degree
relative (mother, sister) who has had cervical cancer.
USE OF ORAL CONTRACEPTIVES: Studies have reported a strong association between cervical
cancer and long-term use of oral contraception (OC). Women who take birth control pills for
more than 5 - 10 years appear to have a much higher risk HPV infection (up to four times
higher) than those who do not use OCs. (Women taking OCs for fewer than 5 years do not
have a significantly higher risk.) The reasons for this risk from OC use are not entirely clear.
Women who use OCs may be less likely to use a diaphragm, condoms, or other methods that
offer some protection against sexual transmitted diseases, including HPV. Some research also
suggests that the hormones in OCs might help the virus enter the genetic material of cervical
cells.
HAVING MANY CHILDREN: Studies indicate that having many children increases the risk for
developing cervical cancer, particularly in women infected with HPV. SMOKING Smoking is
associated with a higher risk for precancerous changes (dysplasia) in the cervix and for
progression to invasive cervical cancer, especially for women infected with HPV.
IMMUNOSUPPRESSION: Women with weak immune systems, (such as those with HIV / AIDS),
are more susceptible to acquiring HPV. Immunocompromised patients are also at higher risk
for having cervical precancer develop rapidly into invasive cancer.
DIETHYLSTILBESTROL (DES): From 1938 - 1971, diethylstilbestrol (DES), an estrogen-related
drug, was widely prescribed to pregnant women to help prevent miscarriages. The daughters
of these women face a higher risk for cervical cancer. DES is no longer prsecribed.
Methodology used
First of sll the data is gone through to check for any NaN values. When found any NaN
values, based on whether it’s a continuous variable or categorical variable we fill the median
value or we fill 1 respectively. Now we have the data for computation, we are ready for
splitting the data into train/test sets, defining features and labels, and normalisation. Now
shuffle the data randomly into train set and test set.
The following columns of the data were used as features for the train and test sets:
1. Age
2. Number of sexual Partners
3. First Sexual Intercourse
4. Num of pregnancies
5. Smokes (years)
6. Smokes (packs/year)
7. Hormonal Contraceptives (years)
8. IUD (years)
9. STDs (number)
10. STDs:condylomatosis
11. STDs:cervical condylomatosis
12. STDs:vaginal condylomatosis
13. STDs:vulvo-perineal condylomatosis
14. STDs:syphilis
15. STDs:pelvic inflammatory disease
16. STDs:genital herpes
17. STDs:molluscum contagiosum
18. STDs:AIDS
19. STDs:HIV
20. STDs:Hepatitis B
21. STDs:HPV
22. STDs: Number of diagnosis
23. STDs: Time since first diagnosis
24. STDs: Time since last diagnosis
25. Dx:Cancer
26. Dx:CIN
27. Dx:HPV
28. Hinselmann: target variable
And the Label will be the ‘Biopsy’ column in the test data.
Now the data is to be normalised. This was done using the package preprocessing from
sklearn. The next step is to run the model on this data. I have followed a very simple, quick and
effective MLP approach to solve binary classification problem. The following is the screenshot of
the result of the training.
I have used the sequential model from keras along wih the dense and dropout layers.
The following are the bar graphs with respect to train history:
The model ran very fast because the data set was small.
The confusion matrix of the test data set is as shown below:
Sensitivity, specificity, false_positive_rate and false_negative_rate were then calculated and the
results were as follows:
Age has a good positive correlation with most of the patients in the age group 20-35 requiring
biopsy.
Against our intuition Number of sexual partners has a low correlation with Biopsy.
From the above graph it can be seen that Cytology has a good correlation with biopsy.
From this graph it is evident that Schiller has a really high correlation with biopsy.
Results
In order to find what effects the biopsy the highest correlation among all the elements was
derived using a heatmap.
From the image it can be seen that Schiller_1 effects the biopsy the highest. The following
heatmap gives a better insight into it.
Conclusion
It seems that 'Schiller_1', 'Hinselmann_1' and 'cytology_1' had the highest correlation with biopsy (+).
The result matched the common sense of the medical knowledge: High specificity diagnostic tool would
have low false positive error. The correlation in these values was able to give the model an insight into
the data and when predicted for test values the accuracy pretty high.