Lung Cancer Detection Using Machine Learning

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 24

LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

ABSTRACT

Nowadays, the changes in food habits and working culture among the individuals causing many
health issues, in which cancer is most important life threatening disease amongst people. Lung
cancer is most common among the cancer patients all over the world. Early detection and
diagnosis of cancer helps in long survival for patients, whereas failure in early detection may
cause fatal end. Data mining techniques like classification, regression and clustering are more
powerful in disease detection. In this work, we proposed regression analysis on lung cancer
dataset from UCI respiratory data. The regression algorithm used in this study are Lasso, Linear,
Logistic and random forest regression. From our experimental results, it is observed that random
forest regression achieves good accuracy.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

INTRODUCTION

One of the main reasons of non-unintentional death is cancer. From many surveys, it has been
proven that lung most cancers is the topmost reason of most cancers death in human beings
international. The dying tempo can be decreased if humans go for early analysis in order that
suitable remedy can be administered by way of the clinicians inside exact time. In short, the
cancer is an uncontrolled abnormal boom of unusual cells and invades the encircling tissues.
Lung cancer can be in addition generalized in subsections, the first one is non-small cell lung
cancer (NSCLC) and the second is small mobile lung cancer (SCLC). In this paper, the paintings
is primarily based on NSCLC patients on account that it is more complicated and difficult to
treatment. There are many contrasts regarding the detection and remedy of SCLC and NSCLC.
There are diverse methods to locate the lung most cancers, one in all them is to apply its datasets
except SVM and LR algorithms to constructed and increase the category and prediction version.
Discovering the understanding from incredible datasets are regularly worn for records mining
strategies. It has determined its crucial preserve in every pasture inclusive of fitness care. It has
performed a chief position for extracting the hidden facts in the medical databases. The mining
technique is more than the statistics evaluation which includes type, clustering, union law of
mining and prediction. If the cancer has unfold, someone may also experience signs in different
places within the anatomy. Its signs and symptoms are used to calculate the chance stage of
ailment. A lot of signs and symptoms are recognized on the premature segment. They are a pain
within the chest, cough may be chronic, dry with phlegm, common breathing infections,
shortness of breath, fatigue or loss of appetite, chest malaise and hoarseness. The primary
consciousness of this examine is to predict the
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

OVERVIEW

Most of the lung cancer types can be detected at matured stage after the cancer has been spread
to considerable extent by using traditional techniques that is adopted by Physicians Radiologist
worldwide. Though by detecting Lung cancer at that above-mentioned stage, even by providing
the most sophisticated treatment, chance of survival of the patient is very low. Apart from the
above the mentioned problem, the problem of misdiagnosis is another main cause of worry.
Some times a benign category might be identified as malignant and vice versa by Doctors. This
also will put the life of the patients in very high-risk situation.

Smoking cessation, diet modification, and chemoprevention are primary prevention activities.
Screening is a form of secondary prevention. Our method of finding the possible Lung cancer
patients is based on the systematic study of symptoms and risk factors. Non-clinical symptoms
and risk factors are some of the generic indicators of the cancer diseases. Environmental factors
have an important role in human cancer. Many carcinogens are present in the air we breathe, the
food we eat, and the water we drink. The constant and sometimes unavoidable exposure to
environmental carcinogens complicates the investigation of cancer causes in human beings. The
complexity of human cancer causes is especially challenging for cancers with long latency,
which are associated with exposure to ubiquitous environmental carcinogens.

SCOPE OF STUDY

Lung cancer is the one of the leading cause of cancer deaths in both women and men.
Manifestation of Lung cancer in the body of the patient reveals through early symptoms in most
of the cases. Treatment and prognosis depend on the histological type of cancer, the stage
(degree of spread), and the patient's performance status. Possible treatments include surgery,
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

chemotherapy, and radiotherapy Survival depends on stage, overall health, and other factors, but
overall only 14% of people diagnosed with lung cancer survive five years after the diagnosis.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

LITERATURE REVIEW

In [1], the author studied though imaging techniques. They trained the dataset from kaggle bowel
2017 data using convolutional neural network (CNN) algorithm. The author also used SVM
classification for classification as Model A or Model B. Through the CT image scan, they
segmented the portion of lung cancer and extracted the 15 features as text value and used for
SVM algorithm. They found this method achieves good results on classification.
The authors in paper [2] defined the work on early stage detection though unique strategy called
using Ant lion optimizer plate detection from images. The author used optimization technique for
the input feature before sending to the machine learning part. The author compared the work
with Genetic algorithm, wolf optimization and Ant lion optimization, from these they concluded
that ALO method outperforms in terms of accuracy of detection.

The author in [3] proposed lung disease prediction through rule based and classification
techniques. For data preprocessing, the author applied One Dependency Augmented Naïve
Bayes classifier (ODANB) and naive creedal classifier 2 (NCC2), the aim of the author is to
exploit the use of hidden patterns for classification. The author experimented ODANB and
proved that, it get better accuracy more than 70%, and the author also proved that it determines
early stage of cancer through symptoms and patient details.

In [4], the author discussed logistic regression and SVM for lung cancer prediction. Through
their experimental studied various evaluation metrics including true positive, true negative and
false positives and false negatives. Through experimental results it is proved that logistic
regression is achieving more accuracy than SVM algorithm. This work motivates us to do
regression research than classification and clustering.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

In [5], the author discussed about license plate detection for Chinese plates, they used Back
propagation Neural network (BPNN) for character recognition. The author specifically designed
the Chinese plate detection with specific length and width parameter. For identifying characters
they used BPNN, which is trained with 50 epoch. However their model is good, but applicable
only to Chinese number plates.

Numan et.al [6] discussed about lung cancer prediction through Artificial neural networks
(ANN) algorithm, in which the exploited the used of macro neural networks, the simple neural
network connected with no hidden layers to avoid over fitting problem. Their model shown
significance performance on lung cancer prediction accuracy.

In [7], the author studied the lung disease detection through SEER data. The author used five
ensemble models of classification to identify ROC curve. The prediction accuracy achieved is as
high as 90% for the SEER data. The dataset used in this study was lung image dataset, which
they did many pre-processing

The author in paper [8] discussed the prediction through protein attributes as Small cell Lung
cancer and Non-small cell Lung cancer as binary class problem on 1497 protein attributes. They
used feature selection technique and filtered most important feature of only 12 attributes. They
used SVM and ANN for prediction of disease. SVM outperforms in terms of accuracy of lung
cancer detection.

In [9], the author discussed lung cancer prediction through classification techniques. The author
used Naive bayes and J48 algorithm for prediction using Weka tool and classified as three
classes namely A, B, and C. In their experimental results J48 model outperforms the other
models. However, their technique is not much effective as they used all are classification models.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

In [10], the author discussed lung cancer detection through image processing technique. The
author applied Gabor filter and Gaussian rules for preprocessing. The proposed techniques was
efficient through segmentation of tumor area. However, image processing can be used for tumor
detection, the image quality should be high for detection.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

PROBLEM DEFINITION

Artificial intelligence, Machine Learning and Deep Learning are emerging technologies, which
is widely used in medical diagnosis. Nowadays living standards among people in urban cities are
developing, due to which the health issues and life threatening disease such as cancer also
increasing. Artificial intelligences are using a set of machine learning and deep learning
algorithms, in which regression model is one the vastly used algorithm and which bring more
efficiency on learning and detection. Lung cancer, which is seen on the lung are two types, one is
Small cell and another one non-small cell. The use of automated regression analysis on lung
disease as affected or normal (a binary class problem) is very important in computer aided
diagnosis and for accurate and early detection.

EXISTING SYSTEM

Existing techniques imaging is used. They trained the dataset from kaggle bowel 2017 data using
convolutional neural network (CNN) algorithm. The author also used SVM classification for
classification as Model A or Model B.
Ant lion optimizer plate detection is used for lung cancer from images. The author used
optimization technique for the input feature before sending to the machine learning part. The
author compared the work with Genetic algorithm, wolf optimization and Ant lion optimization,
from these they concluded that ALO method outperforms in terms of accuracy of detection.

Classification techniques is used called One Dependency Augmented Naïve Bayes classifier
(ODANB) and naive creedal classifier 2 (NCC2), the aim of the author is to exploit the use of
hidden patterns for classification.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

Logistic regression and SVM for lung cancer prediction.

Lung cancer prediction through Artificial neural networks (ANN) algorithm, in which the
exploited the used of macro neural networks, the simple neural network connected with no
hidden layers to avoid over fitting problem.

DRAWBACKS

 Most of study carried out in image dataset.

 Classification is studied as the major technique for detection

PROPOSED SYSTEM

 Proposed system studies the lung cancer detection using various regression algorithms.

 We have collected the dataset for analysis from UCI repository data and we applied
machine learning algorithm such as regression for lung disease prediction.

 There are three labels named 1, 2 and 3, which represents the severity of disease in
ascending order. This is considered to be a multi-class problem.

 Logistic, Linear Regression, Lasso and Radom forest regression to predict the lung
disease.

ADVANTAGES

 High accuracy of detection

 Multi class problem is solved


LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

HARDWARE REQUIREMENTS

 Processor : Any Processor above 3GHz.


 Ram : 8 GB DDR3
 Hard Disk : 128 GB SSD
 Graphics Card :2 GB
 Input device : Standard Keyboard and Mouse.
 Output device : VGA and High Resolution Monitor.
SOFTWARE SPECIFICATION

 Operating System : Windows 8.1 or higher


 Programming : Python 3.7 and related libraries
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

Architecture Diagram

Regression
Training set algorithm
Dataset

Learn model

Test set
Model

Test model

Predicted
results

Figure 1: System architecture

The above figure represents architecture of proposed system, in which all modules of the work
are represented. User gives input dataset, training model and prediction is mentioned.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

MODULES

The modules included in our implementation are as follows

 Dataset collection

 Data pre-processing

 Training and prediction using Regression Models

DATASET COLLECTION

UCI repository data for lung cancer is taken for study. This contains around 32 samples with 56
attributes and with label. There are three labels named 1, 2 and 3, which represents the severity
of disease in ascending order. This is considered to be a multi-class problem.

The dataset variable names are described below

Variable Attribute Description


name

Class Class value labelled as 1,2 and 3

56 Attributes Total 56 predictive attributes are considered


LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

DATA PREPROCESSING

The dataset is split into train dataset and test dataset. For training purposes the X_Train attributes
considered are 56 attributes from dataset and Y_train attribute is label. We considered the
following phase of work in proposed model.

Figure : Visualization of Dataset as Histogram


LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

TRAINING AND PREDICTION USING REGRESSION MODELS

Logistic regression

Logistic regression is a predictive analysis. Logistic regression is used to describe data and to
explain the relationship between one dependent binary variable and one or more nominal,
ordinal, interval or ratio-level independent variables.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

Flow chart of Logistic Regression algorithm:

Start

Input: training data and testing


data

Computing the regression


coefficients of training data

Sigmoid Function

Finding the relationship between


the training data and the testing
data

Output: the object’s positions

End
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

The following code is used for training and prediction through Logistic regression
#spliting the dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =
0.25, random_state =0 )

model = LogisticRegression()
model.fit(x_train, y_train)

#predicting the tests set result


y_pred = model.predict(x_test)
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

Linear regression

In basic, regression, anticipate scores on one attribute from the scores on a second attribute. The
attribute that anticipated is known as the model variable and is named as Y. The attribute base
for forecasts is known as the prediction attribute and is named as X. At the point when there is
just a single prediction attribute, the prediction strategy is called linear regression. In regression
model, the subject of prediction of Y and plotted as an element of X frame is a straight line.

Exploring ‘b1’

If b1 > 0, then x(predictor) and y(target) have a positive relationship. That is increase in x will
increase y.

If b1 < 0, then x(predictor) and y(target) have a negative relationship. That is increase in x will
decrease y.

Exploring ‘b0’

If the model does not include x=0, then the prediction will become meaningless with only b0.
For example, we have a dataset that relates height(x) and weight(y). Taking x=0(that is height as
0), will make equation have only b0 value which is completely meaningless as in real-time
height and weight can never be zero. This resulted due to considering the model values beyond
its scope.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0.
But, setting zero for all the predictor variables is often impossible.

The value of b0 guarantee that residual have mean zero. If there is no ‘b0’ term, then regression
will be forced to pass over the origin. Both the regression co-efficient and prediction will be
biased.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

The following code is used for training and prediction through Linear regression

#spliting the dataset into training set and test set


x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =
0.25, random_state =0 )

model = LinearRegression()
model.fit(x_train, y_train)

#predicting the tests set result


y_pred = model.predict(x_test)
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

Lasso Regression

Lasso is a powerful regression technique. It works by penalizing the magnitude of coefficients of


features along with minimizing the error between predicted and actual observations. Lasso is
called as L1 Regularization technique. Lasso attempts to minimize the cost function. The cost
function is given as Cost(W)= RSS(W) + α(Sum of squares of weight) Here RSS refers to
‘Residual Sum of Squares’ meaning the sum of square of errors between the predicted and actual
values in the training data set. ∞ is co-efficient that takes various values. There are three cases
for values of α.

1. α = 0; Same coefficients as simple linear regression

2. α = ∞ All co-efficient zero

3. 0< α<∞ co-efficient between 0 and that of simple linear regression

The following code is used for training and prediction through Lasso regression
#spliting the dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =
0.25, random_state =0 )

model = linear_model.Lasso()
model.fit(x_train, y_train)

#predicting the tests set result


y_pred = model.predict(x_test)
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

Random Forest Model

1. Given there are n cases in the training dataset. From these n cases, sub-samples are chosen at
random with replacement. These random sub-samples chosen from the training dataset are used
to build individual trees.

2. Assuming there are k variables for input, a number m is chosen such that m < k. m variables
are selected randomly out of k variables at each node. The split which is the best of these m
variables is chosen to split the node. The value of m is kept unchanged while the forest is grown.

3. Each tree is grown as large as possible without pruning.

4. The class of the new object is predicted based upon the majority of votes received from the
combination of all the decision trees.
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

Figure: Flow chart of Random Forest

The following code is used for training and prediction through Random forest regression

#spliting the dataset into training set and test set


x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =
0.25, random_state =0 )

model = RandomForestRegressor()
LUNG CANCER PREDICTION THROUGH REGRESSION TECHNIQUES

model.fit(x_train, y_train)

#predicting the tests set result


y_pred = model.predict(x_test)

You might also like