1CD22MC043 Part 2
1CD22MC043 Part 2
1CD22MC043 Part 2
INTRODUCTION
Lung cancer , is one among the Cancer types that begin by the abnormal development of cells
in lungs. It is mainly seen among the adults nowadays .due to adult lifestyle and health habits
it affects the lungs also. The cancer starts in lungs when cells in lungs forms the mutation or
continuous uncontrollable growth of the cells occurs ,then the natural cells change its behaver
by mutated cells and later it turns to this cell , it will later damage the lungs also turn to lung
cancer.
Main cause of this type of cancer is from Smoking of tobacco , it is the prime factor for leading
to lung cancer, causing 90% of the occurrence ,while smoking they inhale the dangerous
carcinogens present in it later it enters into the it damages the cells that line the lungs ,initial the
body will repair the damages but later repeated exposure and then it becomes irreversible and
also in second-hand smoke also improves the problem in lungs for the non-smokers.
The lung cancer is the more commonly , disease that is the uncontrolled way of growing of cell
in lung tissues. It’s responsible for large number of deaths than breast, colon and prostate
cancers combined. Generally, these are two primary kind of lung cancer both have different
features and therapies correspondingly.
NSCLC as forms 85% of lung cancer of total cases. It also includes several subtypes of its
own, where in and large cell carcinoma. Adenocarcinoma generally has an origin in the
peripheral parts of lungs and most common among non-smokers or former smokers. Squamous
cell carcinoma very commonly is derived in the middle airways of lungs and it is causes by
smoking. It has a less common histological subtype, along with large amount of cell carcinoma
can be located in any of the part in lung; this type also tends to grow and spread quickly.
SCLC comprises only of 15% in all type cancers. It is a rapid growth cancer and it presents
with early syndromic disseminated disease to systemic other than lung parts in the body unlike
NSCLC. Its highly virulent course, which governs the many of choices of treatment available,
has left very limited treatment choice due to it being limited mainly to chemotherapy with
concurrent use of radiation therapy. Surgical modalities are got to be inapplicable in many of
the conditions because by the time SCLC is diagnosed, it would have disseminated
considerably. we have to distinguish the types of cancer in lungs ,it very much important to
build the treatment tools and improve the patient health.
Advances in image processing: Convolutional Neural Networks have essentially revolutionized
medical imaging by way of automated detection and classification of abnormalities related to
lung cancer in the scans
• Virtual Trials/Simulation: These models of simulation advocate for the assessment of the
potency of new drugs or combinations of medications; this probably is going to increase the
pace of drug discovery and decrease its costs.
• Challenges and Future Directions Data Quality and Integration: Success in this ML depends on
attainability of a large, diverse, good quality dataset characteristics of various patient
populations.
• Ethical Considerations: Equitability, transparency, and accountability of algorithms are
noteworthy while being practiced across different healthcare settings.
• Monitoring for Regulatory Approval: Exploration of methods for the validation of AI-based
diagnostics and therapies, and ways for their regulatory approval pathways assurance with
respect to patient safety and efficacy.
The facilities that are produced involve cross-sectional images of the body, which have higher
resolution detail in comparison with the earlier X-rays and also in exact detection and
characterization of lung bumps and tumours.
Digital Imaging The shift of the imaging systems from an analogue method that was in use
since the beginning of years to a digitalized method in past twentieth century improved the
quality, storage and transmission of medical images. CBSE This opened room for advanced
computational algorithms allowing image analysis which can be integrated into digital systems
to give away .Automated and Quantitative Methods for Lung Cancer Detection.
Applications
Uses of Machine Learning in Lung Cancer
Even lung cancer had been main in the list causes of cancer-related death worldwide, some
innovative strategies in early diagnosis and accurate diagnosis and treatment personalization
have to follow. This is one game-changer in the health sector: ML, with computational
algorithms working on a myriad of large, complex data sets, hence optimizing clinical
outcomes.
• Early Detection and Diagnosis
Image Analysis: The ML algorithms are helps to decode radiographic images; these typically
consist of X-ray and CT scans, for all small signs pertaining to pulmonary cancer. All this
categorization and classification done by Convolutional neural network methods result in a very
precise diagnosis, thus carrying out an early stage diagnosis.
• Radiomics:
In more detail, features in the tumour that could be recognized by the human eye on an image
are extracted as better qualities from medical images using ML algorithms. This has opened
up the potential for determining radiomic features driving tumour histology, prognosis, and
response to therapy, which enables strategies adapted on an individual basis and Prognosis and
Risk Assessment.
• Predictive Modelling:
Likewise, ML models can utilize the entirety of the distributed data coming from the clinic,
imaging, and genetics in the attempt to predict patient outcomes or the course of disease.
It can allow risk stratification tools to recognize high-risk patients for whom intensive
surveillance or advanced intervention is useful.
• Genomic Analysis
The Algorithms developed in ML focus on genomic data identifying biomarkers for
susceptibility, prognosis, response to targeted therapies against lung cancer Personal
medication approaches, Investigate genetic profiles of individuals to Tailor respective
medication regimens That increase their efficacy and decrease the, Side effects. The mere
application of ML in drug discovery allows much faster predictions of efficacy in finding
potential targets than with the traditional approaches of optimizing therapeutic combinations.
Virtual screening models the interactions of drugs with targets and does prophesize a
pharmacological outcome, thus fast-tracking the improvement of novel therapies.
MOTIVATION
It represents the many more classifications that are providable for the inclusion of ML
techniques for identification of the disease . Better then that ,it will provide the good prognostic
living rates in the patients diagnosed at relatable in the initial of the stage of this type of cancer.
It is the not modern of being diagnosed by the help of chest X-rays and the CT scan ,which
will read the grounding on the subjective and involves in the radiologists .
That is when the diagnosis can be less late and not able to detect the disease in the early of the
stages . An automated system able to review vast data in a second pertaining to issues on
medical imaging would be one that people would want, only to be very well picking up lung
cancers that may or else had been missed if read by the human eye. Basically, it is pattern
recognition by ML models at the core of the ability that will let human beings identify a
suspicious spot/lesion in lung imaging cases.
By the duplications from the diagnostic radiology complexity have the large amount of the
information about the patient in the medical images , levels against applications are graphical
picture of the anatomical and well functional description that are CT and PET scans , have more
difficulties to track those patterns that are not active of cancerous growth or development .
It may give the wrong result sometimes like negative or positives , making the decrease of
bounds by using the ML Methods on non-cancerous condition as being the most of the
cancerous and in this kind of cases sometime it wont be same as the previous one ,it may give
or generate the different result as compared to the previous one , it can not be completely result
is true it may very form the result to result some times.
Precision medicine: Algorithms developed for the above use can be used under the theme of
precision medicine, whereby this analysis consists of all data of the patient, The cancer starts
in lungs when cells in lungs forms the mutation or continuous uncontrollable growth of the cells
occurs ,then the natural cells change its behaver by mutated cells and later it turns to this cell ,
it will later damage the lungs also turn to lung cancer. it is the data itself, its availability, and its
quality .The availability of big, annotated data sets has traditionally been a bottleneck,
especially when training very even and highly robust ML models. Next come challenges in data
curation and model generalization It comes with questions of interpretability and validation,
since models in the area of deep learning are treated, at least for the time being, like black-box
applications first, it should lay out the transparency of
predictions from a model; and secondly, there should be mechanisms established into
validation frameworks is a robust level of trust by clinicians in way of achieve regulatory
approval. Overcoming such technical, interoperability, regulatory, and training-related
challenges would be important for successful clinical implementation and adoption of ML-
based tools in the improvement of diagnostic accuracy and patient care. Future directions and
emerging trends Future of research in finding the use of machine learning in lung cancer
detection .The second step includes multimodal fusion.
3. Role of machine learning in meeting up the challenges: Machine learning would rescue the
ambition to detect lung cancer by the following process
a. Advanced Pattern Recognition:
Types of ML algorithms can process image data in bulk at a quick pace with amazing
accuracy—this level of processing will help in identifying even very It would look for patterns
and aberrations that would indicate the very early stages of cancers.
b. Diagnostic Sensitivity: machine learning approaches with medical imaging data could reduce
false positives and false negatives, managing to improve diagnostic sensitivity
Clinical Decision Support: This diagnostic tool, since it is in itself a machine learning model,
might very easily be run for decision-aid support to healthcare is the one can give insights and
recommendations by data analysis at hand, courtesy of such patient images.
4 . Research Gap and Innovation Despite the Colossal strides made up to this point towards
improving technology in medical imaging, there is still a wide research gap in coming up with
robust, ML-driven solution for the identification of lung cancer. Validation Studies Many of
ML health models in existence today are still in development. To this day, the enormously
stringent and vital validation process using large heterogeneous datasets to provide insurance
that those models may be come to be robust and generalizable.
Translation of Machine Learning Diagnostics to Clinical preparation ,This would not only
require technical considerations but also regulatory and ethical considerations in view of
translation of the ML-based identification techniques into general clinical practice. The sub-
module shall also include transparent, open algorithms and education of clinicians Notice the
issue of its application to health, coupled with data protection Ethical issues The major ethical
concerns of the application of ML in healthcare at this moment relate to privacy, informed
consent, and fairness in access to health care. For algorithm development and implementation,
it is now paramount that responsible and ethical assurance be retained in information-sensitive
areas.
CHAPTER 2
LITERATURE SURVEY
Early detection may make very big differences for better patient outcomes. This thus opens
abroad visibility into the current potential of machine learning in the improvement of lung
cancer detection from medical imaging data in factors of accuracy and efficiency. The objective
of review the historical development, approaches, and state of application of ML in lung cancer
detection. Machine Learning Applications: Literature Survey in Detection of Lung Cancer The
lung cancer disease is globally considered to be a prime factor of mortality and morbidity
Suzuki et al., in our case, for example, enabled the prediction of malignancy in pulmonary
nodules using what they described as radiomics—quantitative features resulting from medical
image data—together with integrated machine learning classifiers that assured better diagnostic
accuracies more this is generated from conventional imaging interpretation. It appeared to
enable the visualization of the heterogeneous patterns of evolution and of tumour-associated
research history.
These deepening learning architectures push radiography into the area of key detection of lung
cancer. For example, it seems to be very promising in incorporating 3D-CNNs aimed at spatial
and structure patterns required to make a proper diagnosis within the analysis from volumetric
data from CT scans. For example, Ardila et al. developed a deep learning method can reach
sensitive detection of malignant neoplasms, for example, lung nodules, which would be likely
to be detected on a low-dose computed tomography in general such as radiologists bunch.
Proven high potential of this new deep learning approach is to surmount interindividual
variability in human interpretation.
Equally valuable are other studies. For instance, Liang et al. in 2020 did the investigative effort
for the incorporation made, this time with transfer learning as a means, which uses a pre-trained
neural network method to fine-tune other One such, at hand, is the differentiation of lung
nodules. Their fine-tuned model outperformed others in the identification of nodules present
across different modalities data and demographics of a patient than current models, which are
pretrained across large-scale datasets. Thereafter, this will enable a person to design models in
short periods and extrapolate across the entire database—the most remarkable realization in
clinical application.
Machine Learning is to detection of Lung Cancer: A countable challenges will have to be
faced, especially in:
First of all, it is the data itself, its availability, and its quality. The availability of big, annotated
data sets has traditionally been a bottleneck, especially when training very even and highly
robust ML models. Next come challenges in data curation and model generalization
It comes with questions of interpretability and validation, since models in the area of deep
learning are treated, at least for the time being, like black-box applications. Few questions that
wanted to be answering are: first, it should lay out the transparency of predictions from a model;
and secondly, there should be mechanisms established into validation frameworks where will
be a robust level of trust by clinicians with aim of achieve regulatory approval.
Overcoming such technical, interoperability, regulatory, and training-related challenges would
be important for successful clinical implementation and adoption of ML-based tools in the
improvement of diagnostic accuracy and patient care.
Future of research in finding the utilize of machine learning in lung cancer prediction:
Second step includes multimodal fusion, where features from applied imaging modalities are
combined, integrating the information given through both CT and PET with non-imaging-based
biomarkers, including genetic data, which allows for the diagnostic process for the overall
personal treatment plan in ischemic stroke.
Explainable Artificial Intelligence: With the least compliance, this would be explainable
versions of the ML models have should make clinical justifications with regards to the decisions
of the model to the people. With them, the reasoning or the logical reasoning behind the
diagnosis has to be vivid.
Real-Time Decision Support Systems: Develop real-time clinical machine learning algorithms
will gives the decision support in diagnosis for radiologists and oncologists.
2.1 Existing and Proposed System
AI Ethical Framework Articulation: It is, therefore, going to require a pretty open, fair, and sane
process at each step taken in the improvement of the AI framework—from the very conception
of issues like bias and privacy that may be embedded within the model to the governance
framework laid out on how the model have deployed under a compliance framework.
It's true regarding in respect to the system—what we're claiming the power for—but along with
respect to an implementation that makes use of only a few general principles:
Yet, these methods open ways for an artificially generated and synthesized part of the data to
not realize the need to send it back for increased data collection, thereby realizing diversity and
representation. This will then lead to further representativeness, causing gains in
generalizability or even better model performance.
Improved models continue to be updated with regard to suggestions by clinical experts.
Therefore, the adaptations will increasingly be applied to real-world data. Tests.
Participation: Healthcare providers, research institutions, ethical bodies collaboration on data
sharing and validation studies.
2.2 FEASIBILITY STUDY
Feasibility study in relation to find out the lung cancer with machine learning would be referred
to as a process aimed, in practice, at finding out whether such could be possible and more
feasibly implemented within a clinical setup. It is an examination of various factors associated
with the capability for doing the implementation of the approach in regard to technical,
economic, and operational aspects concerning legal or ethical aspects. All the above-mentioned
factors will be considered while estimating practical challenges and benefits existing in using
machine learning-based models for initial lung cancer detection and diagnosis.
Computational Resources: Machine learning deployed for lung cancer detection could already
be resource-intensive at the best of times. Thus, it will require either high-performance
computing clusters or power supplied through a graphics processing unit from the computer
itself, considering that it has to work along with large amounts of data and train models of
optimum efficiency. The other will be economic feasibility due to the availability of resources
and scaling of algorithms.
Imaging data acquisition and storage costs: With each and every one of these lineups, normally
a huge amount of imaging data is usually acquired and stored; this will at times call forth some
costs, especially in scenarios where datasets is wanted has a sourced from some health care
institutions, or those collected will want some extensive annotation by medical personnel. This
therefore brings in more issues with the economics of processing data for HIPAA compliance
based on the various media available in storing information.
Development and Implementation Costs: Development of the full software leading to such
functionality, along with tuning corresponding algorithms and setup infrastructure of hardware
and its integration into existing clinical systems, shall be balanced against the possible benefits
in increased diagnostic accuracy and reduced cost of treatment of better results having lung
cancer.
ROI: a machine learning system regarding initial lung cancer prediction connotes benefit that
would be derived from money that is saved through early diagnosis and planning for customized
management; it reduces expenditure on health through may rise in efficiency or even created
revenue as the output of commercialization for developed technologies like diagnosis software.
Scalability and sustainability: It shall support the volume and complexity of data that most
probably will develop in the near future. Upgrading and maintenance of machine learning
models with new data as much as possible will make, in the long run, the advancement of AI
technology to be applicable and effective in the observation of Lung Cancer become
sustainable.
Introduction
In better hands, machine learning can do wonders for the health sector in the observation and
diagnosis of the dangerous, fatal disease cancer. This paper will attempt to define the tools,
technologies, algorithms, and software that are otherwise it have to be used for lung cancer
identification using the machine learning technique. I will major with studies on most popular
algorithms; Logistic Regression, K-NN, RF, DT, SVM, Naive Bayes. As a software
environment, it will make use of Python programming language implementation in Visual
Studio Code.
The flexibility, usability, and powerful libraries and frameworks make Python the principal
language for development both in data science and in machine learning. In machine-learning
applications, fine ecosystem richness is likely among the principle reasons their popularity is
increasing. It features a rich library, including NumPy, Pandas, Matplotlib, and Scikit-learn,
which makes for a very essential package in data manipulation, visualization, and
implementation of machine learning algorithms. Easy to Learn/Use: The exactitude in the
syntax makes it rather easy to learn for beginners and also for the veteran developers.
Community Support: It has a big, active community that shares libraries and knowledge and,
afterward, offers support through forums and online communities.
Extensibility: It has great community and developer support for plenty of extensions in VS
Code to extend the functionalities toward tasks like Python development, debugging, and
version control, etc.; tasks related to Git; and integration with Jupyter Notebooks.
The majority find it user-friendly. Syntax highlighting and code completion are much more
fitting to be used within a user-friendly interface. Internal terminal and All debugging
instruments for building and testing the machine learning models.
3. Library of scikit-learn
Sci-kit- learn, one of the corner packages of Machine Learning in Python, ranges from pre-
processing the data to the choice or specification of method and through estimation and
deployment. Here are a few basic characteristics:
ML Algorithm Implementation: Scikit-learn develops machine learning algorithms with LR ,
KNN, RF, DT, SVM for both supervised and unsupervised strategies of learning.
Uniform API: This module proposes a uniform API over different algorithms. This goes a long
way to making it very easy to test several algorithms for comparisons and contrasts of their
performance over varied learning problems.
Model Evaluation: Scikit-learn has utilities to test or evaluate the models in terms of tests is
conducted with regarding to certain metrics such as accurateness, correctness, recall, F1-score,
ROC-AUC curve, confusion matrix, and so on. These will help developers in ascertaining the
importance of lung cancer detection models.
One major thing about the coefficients from logistic regression is that they are interpretable—
that is, Archana has given these features relative importance toward prediction. This may help
learn latent factors contributing to detection. Finally, the real benefit of logistic regression here
is simple and very interpretable. It will not manage to model the complex interactions of
features that happen in high-dimensional data. A typical case would be medical images.
Scalability: Yes, this is manageable in an efficient manner for large data and is parallel;
therefore, there is huge potential applicability for lung cancer detection in abundant cases
considering information in massive collections of medical image data.
Decision Tree:
A tree-form model of decisions and their possible consequences. Decision trees decompose a
data set to less data sets while building up a connected decision tree. A very important thing to
be mentioned here is that the relatively easy-to-explain, easy-to-visualize decision trees inform
about the hierarchical decision process throughout ….
Managing Non-linear Relationships: It can hope to capture the drastic non-linear relationships
between image features and image outcomes, which would be helpful in trend and pattern
identifications that are otherwise cumbersome regarding lung cancer.
Pruning: The paths are pruned to simplify the decision tree so as to not overfit and, hence, in
general, give a better performance against unseen data; or else , they are merged along with
ensemble methods like the Random Forest.
4.5 SVM
SVM is a supervised machine learning classification or regression methodology. It extracts the
optimal hyperplane to classify data term of varying classes. Kernel ideas are more useful in
the working of medical imaging data, especially with regard to data transformation into high-
dimensional spaces, where perfect patterns are involved in complex sets that elevate the result
in the process of classification.
Margin Maximization: It maximizes the margin between different classes and, therefore,
generalizes the model on unseen data and reduces the chance of memorizing the data or
overfitting.
High-Dimensional Spaces: Computation in dealing with high-dimensional data sets is efficient.
Hardware requirements
Software requirements
• OS used : Windows 11
• Languages : Python
• Database : SQLite database
• Editor : VS Code
CHAPTER 3
SOFTWARE REQUIREMENTS SPECTIFICATION(SRS)
This document outlines the requirements to build a system in which machine learning
algorithms shall be performed over the test reports of uploaded lung cancer cases by users to
prevent bounded outcomes: either detection or non-detection of cancer using algorithms such
as the KNN, SVM, Logistic Regression, and Random Forest. It will also back up the result
interpretation between the users and the administrators.
This work deals with the development of a Web-based application to upload test reports
regarding lung cancer by the patients. Thereafter, the uploaded data undergoes pre-processing
in the system, hence the system feeds the data to the machine learning algorithm to output a
prediction. The exact mission is took out disaster in the way of cancer detection using the
Random Forest algorithm. A collection of different predictions results wants to be scalable,
explained, and presented to Users—Patients or Admins—through a comfortable interface.
.
3. 1 Functional Requirements
Use Cases-1:
Description: The user will be able to upload the data about his lung cancer test reports.
Actors: User
Preconditions: User shall have ready lung cancer test report file.
This shall finally save report data into the system for processing.
Main Flow:
User report data should be filled
After giving the it analyse and detects the result
If prediction is cancer free it displays the no cancer interface
If prediction is true if displays the Cancer detected ,along with that precautions
System saves report data in the database.
Use Cases-2
Process Data and Apply Algorithms
Description: The system processes the uploaded data and subsequently applies machine
learning algorithms to it.
Actors System Preconditions: A report for the test of lung cancer is already in the machine
learning system.
Postconditions Algorithms: predict by the input to the algorithms provided by user.
Main Flow: Get stored data of Lung cancer test report. The data may be normalized and feature-
selected.
Exceptions-The system gives a message, if in case the prediction does not have the result, no
result found.
Usability Requirements:
This user interface shall then be intuitively designed in the form of user can overview by
guidance. Security Requirements There should have encrypted data across both transit and
storage when at the start Position. Even the network mechanisms for user authentication and
authorization are in place.
ACCURACY
Our project fundamentally puts human centred design at the core in the improvement of a robust
system on lung cancer predicted by machine learning techniques that shall be implemented in
Python language through the VS Code environment. At the core of our system would be that
feature, which gives users—most probably patients—the capability to upload their reports on
lung cancer tests. Immediately after this is uploaded, the system rigorously starts validating data
in pursuit of integrity and files' compatibility. Dataset normalization then follows with feature
selection. This is an vital stage that highly improves the output of eventual prediction.
Several algorithms in machine learning will be applied:
K-Nearest Neighbours—the key to the functionality of our system, and Random Forest. After
rigorous evaluation, Random Forest turned out to be the algorithm that provided unbiased more
better result for our data set and was unmistakably ahead of other models in predicting the
existence or non -existence of lung cancer. All this predictive power is very important because
it gives the system the capability to provide actionable insights to users and administrators,
hence facilitating informed decisions for probable medical interventions. This approach is sure
to bring perfection or excellence in a technically perfect implementation of the algorithm and
gives peace of mind to users in activities by bringing intuitive interaction through a clean
interface.
SECURITY
Security is a top priority in our lung cancer detection project; the protection adopted for
sensitive medical data and secure interactions are very tight. During transmission or storage,
the data remains unreadable to any unauthorized entity; state-of-the-art encryption protocols
protect it. Strong mechanisms of authentication prove the identity of a user before he/she can
gain entry onto the platform; hence, assurance of data integrity against unauthorized entry is
guaranteed. There can be found resolution policies that outline the authorization for users in
accordance with their roles and enforce access control. Conformity with standards of regulation
like GDPR and HIPAA is adhered .It focuses on patient privacy and articulates principles of
good practice in handling data. Finally, proactive vulnerability assessment-themed security
audits improve our defence in view of emerging threats and underline this commitment to
keeping the environment safe for the harnessing of machine learning in healthcare.
RELIABILITY
Our lung cancer detection project will rely on reliability, high performance, and reliable results
to be acquired by the user. At our fingertips, we will be ready to develop a strong architecture
able to truly bear various loads in the system without trading off either accuracy or
responsiveness. Assure integrity of data through robust validation of uploaded reports; ensure
safe storage and process. Among the more accurate for making reliable predictions, it is found
that the latter is most accurate. It means monitoring system metrics as a continuum process and
fixing problems before they happen to provide uninterrupted availability and hence reliable
service delivery to the healthcare professional and patient community
CHAPTER 4
SYSTEM DESIGN
4.1 Context diagram
4.2 Data flow Diagram
CHAPTER 5
DETAILED DESIGN
User Report
Preprocessing
User
Training
Classification
5.2 ACTIVITY OF DIAGRAM
The input is the report details. It starts with "Patient Submits Report Details." The flow then
moves to "Data Validation," which reviews the accuracy and completeness of data, followed
by "Data Integration" to integrate the information with their previous medical records. Feature
Extraction" extracts the features from the data is related for use during "Machine Learning
Model Training." Details of the report are then analysed by the trained model in a consequent
detection step. Radiologist Reviews Findings and Oncologist Analyses Results hence ensure an
accurate diagnosis is facilitated. Finally, the Generate Report activity consolidates the findings,
making them known to the Patient and the Medical Professionals.
CHAPTER 6
IMPLEMENTATION
What it did was an exciting accuracy to the order of 89% for the Random Forest algorithm,
which encompasses attractiveness in detecting lung cancer based on analysis from patient report
data. It generates many decision trees and then merges their results to identify strong predictors
for lung cancer.
Such high accuracy is immense support to radiologists and oncologists since this places them
at a very reliable preliminary assessment. Random Forest improves early detection outcomes
and thus prognosis of the patients, with better utilization of resources for health facilities in its
implementation in health systems.
prediction6 = rf_classifier.predict(x_test)
confusion_matrix(y_test,prediction6)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)
Accuracy: 0.8932038834951457
Precision: 0.9042553191489362
Recall: 0.9770114942528736
F1 score: 0.9392265193370166
accuracy_score(y_test,prediction6)
probs = Model1.predict_proba(x_test)
precision_score(y_test, prediction6, average = None)
recall_score(y_test, prediction6, average = None)
f1_score(y_test, prediction6, average = None)
Input
INPUT
RESULT
import pandas as pd
import matplotlib. pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
lung_data = pd.read_csv("survey lung cancer.csv")
lung_data.head()
lung_data.tail()
x = lung_data.iloc[:,0:-1]
print(x)
y = lung_data. iloc[:,-1:]
print(y)
lung_data.GENDER = lung_data.GENDER.map({"M":1,"F":2})
lung_data.LUNG_CANCER = lung_data.LUNG_CANCER.map({"YES":1,"NO":2})
lung_data.shape
lung_data.isnull().sum()
lung_data.dtypes
lung_data.head()
lung_data.tail()
lung_data.describe()
lung_data.info()
Model1 = LogisticRegression()
Model1.fit(x_train, y_train)
prediction1 = Model1.predict(x_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
confusion_matrix(y_test,prediction1)
accuracy_score(y_test,prediction1)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)
0.8737864077669902probs = Model1.predict_proba(x_test)
precision_score(y_test, prediction2, average = None)
cm = confusion_matrix(y_true = y_test, y_pred = prediction2)
#plot_confusion_matrix(cm,level,title = "confusion_matrix")
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state = 0,criterion = "entropy")
tree.fit(x_train, y_train)
prediction3 = classifier.predict(x_test)
confusion_matrix(y_test,prediction3)
from sklearn.metrics import precision_score, recall_score, f1_score
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)
accuracy_score(y_test,prediction3)
probs = Model1.predict_proba(x_test)
precision_score(y_test, prediction3, average = None)
recall_score(y_test, prediction3, average = None)
f1_score(y_test, prediction3, average = None)
cm = confusion_matrix(y_true = y_test, y_pred = prediction3)
#plot_confusion_matrix(cm,level,title = "confusion_matrix")
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()
#Support Vector Machine
from sklearn.ensemble import BaggingClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
svm
OneVsRestClassifier(BaggingClassifier(SVC(C=10,kernel='rbf',random_state=9,probability=
True),n_jobs=-1))
svm.fit(x_train, y_train)
prediction4 = svm.predict(x_test)
from sklearn.metrics import precision_score, recall_score, f1_score
# assuming your predicted and actual labels are stored in variables y_pred and y_true,
respectively
print("Accuracy:", accuracy)
print("Precision:", precision) cm = confusion_matrix(y_true = y_test, y_pred = prediction4)
#plot_confusion_matrix(cm,level,title = "confusion_matrix")
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()
import pickle
with open('model.pkl', 'wb') as file:
pickle.dump(rf_classifier, file)
print("Recall:", recall)
print("F1 score:", f1)
accuracy_score(y_test,prediction4
app = Flask(__name__)
withopen('C:/Users/ANIRUDHTV/Desktop/Lung_Cancer_Prediction/Lung_Cancer_Predictio
n_Using_Machine_Learning-main/Lung_Cancer_Prediction_Using_Machine_Learning-
main/model.pkl', 'rb') as file:
model = pickle.load(file)
@app.route('/')
def index():
if request.method == 'POST':
try:
data = [
float(request.form['gender']),
float(request.form['age']),
float(request.form['smoking']),
float(request.form['yellow_fingers']),
float(request.form['anxiety']),
float(request.form['peer_pressure']),
float(request.form['chronic_disease']),
float(request.form['fatigue']),
float(request.form['allergy']),
float(request.form['wheezing']),
float(request.form['alcohol_consuming']),
float(request.form['coughing']),
float(request.form['shortness_of_breath']),
float(request.form['swallowing_difficulty']),
float(request.form['chest_pain'])
]
if __name__ == '__main__':
app.run(debug=True)
CHAPTER 7
SOFTWARE TESTING
Testing developed software and its model in a machine learning-based system for lung cancer
prediction is very important to work perfectly with accuracy, reliability, and efficiency. Testing
itself has sub-categorical division into unit, integration, and performance categorization in the
widest of its senses. Otherwise, unit test cases should revolve around whether such components
as data loading, preprocessing, training of a model, evaluation work correctly or not. This
comprises the proper loading the report data set, all data that is required are preset or not-
present in the data set, and whether the right proportion of train test splitting is made.
It means that such integration tests should prove all these indefectible parts of the system fit
well. One could write an end-to-end pipeline test, simulate it all: data loading, preprocessing,
model training, and prediction. Every step fits perfectly with the other; therefore, the result for
the whole workflow gives expected results. Performance tests guarantee effective and quick
model train and prediction runs. Bring yourself to the understanding that such a test is one that
will help to pull out where a bottleneck or inefficiency lies, ensuring that a reasonable time is
not spent on one particular end.
Save for this, it is through this type of testing procedure that systemic checks with regarding to
correctness and efficiency in relation to the lung cancer detection methods have been
facilitated. If the detection and fixing of errors are extremely early, then it would mean that this
thorough testing approach helps in making the model robust. it works very well in real life.
Since this shall be added to a pipeline of continuous integration and continuous deployment,
any updating or modification shall validate the system and shall continue having the reliability
and accuracy one comes to expect from the model
Types Of Tests
1.Functional Testing
It is defined as black-box testing. This testing transforms to determine whether it conforms to
the specified requirements or specifications.
Integration Testing: this means whether different modules or services interact properly.
System Testing: Managed on the complete system to check anyhow the system meets all
specified requirements.
Alpha Testing: Done by development team
Beta Testing: Done by few of the end-users, before rolling out the product into the market itself.
2.Non-Functional Testing
Non-functional testing includes performance, usability, reliability, or anything else that can not
be under the umbrella of functional testing._
Performance Testing: This aims at getting all types of performance for the software. Among
these are;
Load Testing—It places the system under expected loads and tests their performances.
3.Maintenance Testing
The maintenance testing is carried-out once the software was deployed, regarding the aspect
of whether the software continues to work fine.
Regression Testing: Additional functionality is checked this technique if it have any side effects
to the existing functionality.
Retesting that have to be out for the defects or bugs fixed with the aim of resolving various
issues related to them.
Maintenance testing means the testing carried out when changes or modifications are introduced
in the present state of software—that is, for example, making updates or patches to the software.
Manual Testing: Human conducts the tests later on-the test cases are execute with no support
to automation tools. In such a case one can rapidly perform an exploratory, usability and ad-
hoc tests.
5. Exploratory Testing
Exploratory testing is an instance of a test process in which the learning of the system, test case
design, and test case execution are all interlinked.
The testing is explored and has been defined as the test design method checking completed in
parallel with the tester who is striving freely or with some basic knowledge.
Ad-Hoc: informal and heuristic testing, used without any test plan or testcase documentations.
‘1’
Test Case-1 Cancer is Predicted Patient data is Positive
(Cancer detected -
Positive (input is valid)
YES)
‘0’
Test Case-2 Cancer is Predicted Patient data is (Cancer detected
Negative Negative NO)
(input is valid)
input fields
Test Case-4 Handling Missing Model with no patient data Error logged, user
Model loaded notified
Unexcepted error
Test Case-5 occurred during Patient’s valid data Error logged ,user notified
prediction
A trained model will be used to make prediction for new, or test, data. Thereafter, this have to
be verified using unit tests to make sure prediction function is performing welly and producing
results in a timely fashion; more importantly, that the results will be in the correct format.
Unit test by unit test, write it and run it to guarantee that each of the individual components is
correct in giving an embossed system for predicting correctly and reliably on a project about
predicting lung cancer.
CHAPTER 8
CONCLUSION
It will design a sophisticated system to the prediction of lung cancer makes use of ML
techniques, especially targeted at the Random Forest algorithm. The main gaol of the work
presented therefore will be to build a model capable of processing details in the reports of
patients and ensuring the right identification of lung cancer. Another reason to choose the RF
is that it provides high accuracy for classification. The model was taken for its geopolitical
capabilities: it allows one to correctly process very large sets of high-dimensional data without
overfitting, ranking features by their importance.
one set of data was created from this vast collection of very heterogeneous patient records; the
features presumably conceived for determining lung cancer include demographic information
and previous medical history regarding diseases, symptoms, test results.
They still to a small extent, learn knowledge and skills in data preprocessing: replacing the
missing values, rescaling the range of features, and encrypting categorical place holders of
these data to be used afterward in the king of instruction a machine learning model. A random
forest model will be instructed using one place of the dataset, and then performance will be
tested with another next part of the dataset.
We have verified this method with respect to major metrics like correctness , perfection , recall,
F1-score, places below the receiver characteristic curve. The model thereby comes out to be
quite accurate with respect to classification and hence rakes in very good performances for non-
cancerous versus cancerous case distinctions. These metrics were balanced within the F1-score,
hence pointing toward the model. The AUC-ROC is a graph of diagnostic ability at multiple
form of threshold settings and how performance increases by areas under the curve.
In at least one scenario, to the best of the writer's knowledge, Random Forest clearly applies
itself to its applicability in the current project, hence forming it a potentially good algorithm to
have to be an aiding tool in lung cancer diagnosis. Available statistics and literature suggest
that early lung cancer diagnosis mostly allows good patient outcomes in most instances due to
timely intervention and treatment. On that note, such automation in the area will enable health
professionals to make better diagnoses with a max level of correctness and speed, hence good
patient care. The impact is beyond immediate effect to lakhs cause of people have their
procedures changed for diagnosis under the machine learning model of random forest taken
into clinical practice. Conventional diagnostic techniques are of a more time-consuming nature
and manual in medical data analysis. Naturally, this is more prone to errors. These machine
learning methods interpret large amount of information to present reliable diagnostic support.
This ranking of variables is vital for the model; more so in appreciating variables that are
strongly featured for the observation of lung cancer. This information can further be utilised in
finding intrinsic causes and risk factors reporting in a disease process, improvising strategies
for prevention, and building better and more individualized treatment planning. Its
shortcomings and challenges are: It all depends very heavily on good classes and
comprehensive machine learning model training data. This can greatly compromise model
performance and generalizability when built from such incomplete or biased data. For accuracy,
Random Forest models are pretty good; however, it can turn very hard to read and complex.
Longitudinal studies on method operates in time and its effects on patient outcomes further
help solidify meaningful feedback, which is just about necessary for further improvement.
I mean, it's pretty much of a no-brainer that the Random Forest algorithm used in creating a
system is to prediction of lung cancer has gigantic strides toward machine learning for the
betterment of humankind's health .That is, it may mean large potential for ML methods in the
very early and accurate detection of lung cancer, which is very critical to the outcome of the
patients. Such big models as this are going to bring a revolution in clinical practice by increasing
the speed of operation and the accuracy with a low level of human error. Notwithstanding the
challenges, the project paves the way for further research and development in using these
technologies with these promising results coming up.
In the further advancing and webbed expanding of these technologies, it very much may
construct a future in which machine learning surmounts human ability in diagnosis, treatments,
and all care. This project hereby holds views which herald a paradigm shift at the wedding of
medical knowledge with some sophisticated computation techniques. It heralds a beginning for
innovation in a new avenue of medical diagnosis. Another critical step in the line of variation
of delivering healthcare services and improvement of the resulting outcomes for patients is this
research work in developing lung cancer prediction system using a Random Forest model.
CHAPTER 9
FUTURE ENHANCEMENT
Future Enhancement
The project worked on developing a very refined for machine learning, particularly the Random
Forest Algorithm. Now, the aim would be to devise a model such that it could run on patient
reports and minute details in order to detect a person accurately as suffering from lung cancer
or not. One of the cause for choosing Random Forest is that this algorithm is supposedly
tolerant, quite correct on classification tasks, not losing much on effectiveness while increasing
the dimensionality of data, and resistant to overfitting. This would also rank features by their
importance.
Methodology and Implementation: The first step in this was the need for data collection,
presented in one dataset with high diversity in reports by patients with relevant features to the
identification of lung cancer. This comprises independent characters were extracted from
demographic data, from past medical report , symptoms, and results of diagnostic tests. Later
on, there will be preprocessing of data with handling of missing values, normalization of feature
scales, and encoding of categorical variables so that it feeds into a machine learning model. A
Random Forest model will then be trained on one kind of the dataset. Another part will be kept
in case of retesting.
In a random forest model, performance assessment incorporates accuracy, precision, recall, F1-
score, and the area under the receiver's operating characteristic curve. Results gave a high
accuracy for the progress of the model, hence leading to good differentiation in cases of cancer
vs. non-cancer. This is reflected in the balancing of the metrics by the reliability of the model
within the F1-score.One of the plots produced by the AUC-ROC curve was a diagnostic plot
with respect to how good the method was doing at independent threshold setups . It was
designed such that the larger the space under the curve, the better the model performance.
Interpretation of Results:
Clearly, the success of the project working with the Random Forest algorithm only furthers or
cements its position about being one among the important tools in lung cancer detection. Early
detection of lung cancer, to a large extent, improves the result of the sufferer since timely
interventions and treatment are possible. Also random forest models are comparatively
accurate, they often become very complex—they are hard to interpret. Indeed, one of the more
crucial considerations is trying to understand how the method is making its decisions—
especially in an area like medicine where transparency comes first. Even though a random forest
avoids most cases of overfitting, that can still be the case .This would increase confidence in
the estimation that the model not only over- It either means it underfits or generalizes well on
new unseen data.
REFERENCES
• https://www.geeksforgeeks.org/angularjs-ajax-http/?ref=ml_lbp