1822 B.tech It Batchno 358
1822 B.tech It Batchno 358
1822 B.tech It Batchno 358
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY(DEEMED TO BE UNIVERSITY)
BONAFIDE CERTIFICATE
Dr.Subhashini M.E,PhD.,
DECLARATION
DETECTIONdonebymeundertheguidanceofDr.Ajitha(Internal)atSATHYABAMA
submittedinpartialfulfillmentoftherequirementsfortheawardofBachelorof Engineering /
TechnologydegreeinINFORMATION TECHNOLOGY
DATE:
3
ACKNOWLEDGEMENT
IconveymythankstoDr.SASIKALA M.E,PhD.,Dean,SchoolofComputingandDr.Subashini
Head of the Department,Dept.of Information Technologyfor providing
menecessarysupport and details at the right time during the progressive reviews.
I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr.P.Ajithaforhervaluableguidance,suggestionsandconstantencouragement
paved way for the successful completion of my project work.
I wish to express my thanks to all Teaching and Non-teaching staff members of the
DepartmentofINFORMATION TECHNOLOGYwho were helpful in many ways for
thecompletionof the project.
4
ABSTRACT:
A credit card is issued by a bank or financial services company that allows cardholders to
borrow funds with which to pay for goods and services with merchants that accept cards for
payment. Nowadays as everything is made cyber so there is a chance of misuse of cards and the
account holder can lose the money so it is vital that credit card companies are able to identify
fraudulent credit card transactions so that customers are not charged for items that they did
not purchase. This type of problems can be solved through data science by applying machine
learning techniques. It deals with modelling of the dataset using machine learning with Credit
Card Fraud Detection. In machine learning the main key is the data so modelling the past credit
card transactions with the data of the ones that turned out to be fraud. The built model is then used
to recognize whether a new transaction is fraudulent or not. The objective is to classify whether the
fraud had happened or not. The first step involves analyzing and pre-processing data and then
applying machine learning algorithm on the credit card dataset and find the parameters of the
algorithm and calculate their performance metrics.
5
TABLE OF CONTENT
1.1.2Artificial language 13
1.2 Objectives 17
2. EXISTING SYSTEM
2.1 Disadvantages 18
2.2. Proposed system 18
2.2.1Advantages 19
2.3 Literature survey 20
3. FEASIBILITY STUDY23
4. PROJECT REQUIREMENTS23
4.1General 25
4.2.2 Conda 30
4.2.4 Python 35
6
5. SYSTEM DIAGRAMS
5.1. SYSTEM ARCHITECTURE 45
5.7. ER – DIAGRAM 51
6. LIST OF MODULES 52
6.1 Module description 52
6.1.1 Data pre-processing 52
6.1.2 Data validation 55
6.1.3 Exploration data analysis of visualization 55
6.2 Algorithm and Techniques 62
6.2.1 Algorithm explanation 62
6.2.2 Logistic regression 63
6.2.3 Random forrest classifier 66
6.2.4 Decision tree classifier 68
6.2.5 Naive bayes algorithm 71
6.3 Deployment 73
6.3.1 Flask web 73
6.3.2 Features 75
6.3.3 Advantages of Flask 77
6.4. HTML 81
6.5. CSS 84
7. CODING90
8. CONCLUSION112
8.1 Future work 113
9. REFERENCES114
7
LIST OF SYSMBOLS
NOTATION
Class Name
1. Class Represents a
+ public -attribute collection of similar
-private -attribute entities grouped
# protected together.
+operation
+operation
+operation
NAME
Class A Class B
Associations
represents static
2. Association relationships between
Class A Class B classes. Roles
representsthe way the
two classes see each
other.
8
3. Actor It aggregates several
classes into a single
classes.
Interaction between
Aggregation Class A Class A the system and
4. external environment
Class B Class B
5. Relation uses
(uses)
Used for additional
process
communication.
Extends relationship is
used when one use
6. Relation EXTENDS case is similar to
another use case but
(extends)
does a bit more.
Communication
7. Communication
between various use
cases.
9
State of the processs.
8. State State
Represents decision
making process from a
12. Decision box constraint
10
Represents physical
modules which is a
14. Component collection of
components.
Represents physical
modules which are a
15. Node
collection of
components.
A circle in DFD
represents a state or
16. Data process which has
been triggered due to
Process/State some event or acion.
Represents external
17. External
entities such as
entity keyboard,sensors,etc.
Transition Represents
18. communication that
occurs between
processes.
11
1. INTRODUCTION
Domain overview
The term "data science" has been traced back to 1974, when Peter
Naur proposed it as an alternative name for computer science. In 1996, the
International Federation of Classification Societies became the first conference to
specifically feature data science as a topic. However, the definition was still in flux.
The term ―data science‖ was first coined in 2008 by D.J. Patil, and Jeff
Hammerbacher, the pioneer leads of data and analytics efforts at LinkedIn and
Facebook. In less than a decade, it has become one of the hottest and most trending
professions in the market.
Data Scientist:
12
Data scientists examine which questions need answering and where to find
the related data. They have business acumen and analytical skills as well as the
ability to mine, clean, and present data. Businesses use data scientists to source,
manage, and analyze large amounts of unstructured data.
13
highest level in strategic game systems (such as chess and Go), As machines
become increasingly capable, tasks considered to require "intelligence" are often
removed from the definition of AI, a phenomenon known as the AI effect. For
instance, optical character recognition is frequently excluded from things considered
to be AI, having become a routine technology.
The various sub-fields of AI research are centered around particular goals and
the use of particular tools. The traditional goals of AI research
include reasoning, knowledge representation, planning, learning, natural language
processing, perception and the ability to move and manipulate objects. General
intelligence (the ability to solve an arbitrary problem) is among the field's long-term
goals. To solve these problems, AI researchers use versions of search and
mathematical optimization, formal logic, artificial neural networks, and methods
based on statistics, probability and economics. AI also draws upon computer
science, psychology, linguistics, philosophy, and many other fields.
The field was founded on the assumption that human intelligence "can be so
precisely described that a machine can be made to simulate it". This raises
philosophical arguments about the mind and the ethics of creating artificial beings
endowed with human-like intelligence. These issues have been explored
by myth, fiction and philosophy since antiquity. Science fiction and futurology have
also suggested that, with its enormous potential and power, AI may become
an existential risk to humanity.
14
As the hype around AI has accelerated, vendors have been scrambling to
promote how their products and services use AI. Often what they refer to as AI is
simply one component of AI, such as machine learning. AI requires a foundation of
specialized hardware and software for writing and training machine learning
algorithms. No one programming language is synonymous with AI, but a few,
including Python, R and Java, are popular.
15
relevant fields are filled in properly, AI tools often complete jobs quickly and with
relatively few errors.
Natural language processing (NLP) allows machines to read and understand human
language. A sufficiently powerful natural language processing system would
enable natural-language user interfaces and the acquisition of knowledge directly
from human-written sources, such as newswire texts. Some straightforward
applications of natural language processing include information retrieval, text
mining, question answering and machine translation. Many current approaches use
word co-occurrence frequencies to construct syntactic representations of text.
"Keyword spotting" strategies for search are popular and scalable but dumb; a
search query for "dog" might only match documents with the literal word "dog" and
miss a document with the word "poodle". "Lexical affinity" strategies use the
occurrence of words such as "accident" to assess the sentiment of a document.
Modern statistical NLP approaches can combine all these strategies as well as
others, and often achieve acceptable accuracy at the page or paragraph level.
Beyond semantic NLP, the ultimate goal of "narrative" NLP is to embody a full
understanding of commonsense reasoning. By 2019, transformer-based deep
learning architectures could generate coherent text.
16
1.1.4 MACHINE LEARNING
Machine learning is to predict the future from past data. Machine learning
(ML) is a type of artificial intelligence (AI) that provides computers with the ability to
learn without being explicitly programmed. Machine learning focuses on the
development of Computer Programs that can change when exposed to new data
and the basics of Machine Learning, implementation of a simple machine learning
algorithm using python. Process of training and prediction involves use of specialized
algorithms. It feed the training data to an algorithm, and the algorithm uses this
training data to give predictions on a new test data. Machine learning can be roughly
separated in to three categories. There are supervised learning, unsupervised
learning and reinforcement learning. Supervised learning program is both given the
input data and the corresponding labeling to learn data has to be labeled by a human
being beforehand. Unsupervised learning is no labels. It provided to the learning
algorithm. This algorithm has to figure out the clustering of the input data. Finally,
Reinforcement learning dynamically interacts with its environment and it receives
positive or negative feedback to improve its performance.
Data scientists use many different kinds of machine learning algorithms to
discover patterns in python that lead to actionable insights. At a high level, these
different algorithms can be classified into two groups based on the way they ―learn‖
about data to make predictions: supervised and unsupervised learning. Classification
is the process of predicting the class of given data points. Classes are sometimes
called as targets/ labels or categories. Classification predictive modeling is the task of
approximating a mapping function from input variables(X) to discrete output
variables(y). In machine learning and statistics, classification is a supervised learning
approach in which the computer program learns from the data input given to it and
then uses this learning to classify new observation. This data set may simply be bi-
class (like identifying whether the person is male or female or that the mail is spam or
non-spam) or it may be multi-class too. Some examples of classification problems
are: speech recognition, handwriting recognition, bio metric identification, document
classification etc.
17
Supervised Machine Learning is the majority of practical machine learning
uses supervised learning. Supervised learning is where have input variables (X) and
an output variable (y) and use an algorithm to learn the mapping function from the
input to the output is y = f(X). The goal is to approximate the mapping function so
well that when you have new input data (X) that you can predict the output variables
(y) for that data. Techniques of Supervised Machine Learning algorithms include
logistic regression, multi-class classification, Decision Trees and support vector
machines etc. Supervised learning requires that the data used to train the algorithm
is already labeled with correct answers. Supervised learning problems can be further
grouped into Classification problems. This problem has as goal the construction of a
succinct model that can predict the value of the dependent attribute from the attribute
variables. The difference between the two tasks is the fact that the dependent
attribute is numerical for categorical for classification. A classification model attempts
to draw some conclusion from observed values. Given one or more inputs a
classification model will try to predict the value of one or more outcomes. A
classification problem is when the output variable is a category, such as ―red‖ or
―blue‖.
1.2. OBJECTIVES
The goal is to develop a machine learning model for Credit Card Fraud Prediction,
to potentially replace the updatable supervised machine learning classification
models by predicting results in the form of best accuracy by comparing supervised
algorithm
18
Exploration data analysis of variable identification
Loading the given dataset
Import required libraries packages
Analyze the general properties
Find duplicate and missing values
Checking unique and count values
Uni-variate data analysis
Rename, add data and drop the data
To specify data type
Exploration data analysis of bi-variate and multi-variate
Plot diagram of pairplot, heatmap, bar chart and Histogram
Method of Outlier detection with feature engineering
Pre-processing the given dataset
Splitting the test and training dataset
Comparing the Decision tree and Logistic regression model and random
forest etc.
Comparing algorithm to predict the result
Based on the best accuracy
.
1.2.2 SCOPE OF THE PROJECT
The main Scopeis to detect the Fraud Prediction, which is a classic text classification
problem with a help of machine learning algorithm. It is needed to build a model that
can differentiate between Fraud OR not
2. EXISTING SYSTEM:
They proposed a method and named it as Information-Utilization-
Method INUM it was first designed and the accuracy and convergence of an
information vector generated by INUM are analyzed. The novelty of INUM is
illustrated by comparing it with other methods. Two D-vectors (i.e., feature subsets) a
and b, where Ai is the ith feature in a data set, are dissimilar in decision space, but
correspond to the same O-vector y in objective space. Assume that only a is
19
provided to decision-makers, but a becomes inapplicable due to an accident or other
reasons (e.g., difficulty to extract from the data set). Then, decision-makers are in
trouble. On the other hand, if all two feature subsets are provided to them, they can
have other choices to serve their best interest. In other words, obtaining more
equivalent D-vectors in the decision space can provide more chances for decision-
makers to ensure that their interests are best served. Therefore, it is of great
significance and importance to solve MMOPs with a good Pareto front approximation
and also the largest number of D-vectors given each O-vector.
2.1. DISADVANTAGES:
1. They had proposed a mathematical model and machine learning algorithms is not
used
2. Class Imbalance problem was not addressed and the proper measure were not
taken
2.2.1 ADVANTAGES:
20
BANK Dataset
Classification ML Model
Training Algorithm
dataset
General
A literature review is a body of text that aims to review the critical points of
current knowledge on and/or methodological approaches to a particular topic. It is
secondary sources and discuss published information in a particular subject area
and sometimes information in a particular subject area within a certain time period.
Its ultimate goal is to bring the reader up to date with current literature on a topic and
forms the basis for another goal, such as future research that may be needed in the
area and precedes a research proposal and may be just a simple summary of
sources. Usually, it has an organizational pattern and combines both summary and
synthesis.
A summary is a recap of important information about the source, but a
synthesis is a re-organization, reshuffling of information. It might give a new
interpretation of old material or combine new with old interpretations or it might trace
the intellectual progression of the field, including major debates. Depending on the
situation, the literature review may evaluate the sources and advise the reader on
the most pertinent or relevant of them
21
Review of Literature Survey
Companies want to give more and more facilities to their customers. One of these
facilities is the online mode of buying goods. The customers now can buy the
required goods online but this is also an opportunity for criminals to do frauds. The
criminals can theft the information of any cardholder and use it for online purchases
until the cardholder contacts the bank to block the card. This paper shows the
different algorithms of machine learning that are used for detecting this kind of
transaction. The research shows the CCF is the major issue of financial sector that is
increasing with the passage of time. More and more companies are moving towards
the online mode that allows the customers to make online transactions. This is an
opportunity for criminals to theft the information or cards of other persons to make
online transactions. The most popular techniques that are used to theft credit card
information are phishing and Trojan. So a fraud detection system is needed to detect
such activities.
Nowadays credit card is more popular among the private and public
employees. By using the credit card, the users purchase the consumable durable
products in online, also transferring the amount from one account to other. The
fraudster is detecting the details of the behavior user transaction and doing the illegal
activities with the card by phishing, Trojan virus, etc. The fraudulent may threaten the
users on their sensitive information. In this paper, we have discussed various
methods of detecting and controlling the fraudulent activities. This will be helpful to
improve the security for card transaction in future. Credit card fraudulent activities
which are faced by the people is one of the major issues. Due to these fraudulent
activities, many credit card users are losing their money and their sensitive
22
information. In this paper, we have discussed the different fraudulent detection and
controlling techniques in credit card and also it will be helpful to improve the security
from the fraudsters in future to avoid the illegal activities.
Year : 2021
Now a day, credit card transaction is one the famous mode for financial transaction.
Increasing trends of financial transactions through credit cards also invite fraud
activities that involve the loss of billions of dollars globally. It is also been observed
that fraudulent transactions have increased by 35% from 2018. A huge amount of
transaction data is available to analyze the fraud detection activities that require
analysis of behavior/abnormalities in the transaction dataset to detect and ignore the
undesirable action of the suspected person. The proposed paper lists a compressive
summary of various techniques for the classification of fraud transactions from the
various datasets to alert the user for such transactions. In the last decades, online
transactions are growing rapidly and the most common tool for financial transactions.
The increasing growth of online transactions also increases threats. Therefore, in
keeping in mind the security issue, nature, an anomaly in the credit card transaction,
the proposed work represents the summary of various strategies applied to identify
the abnormal transaction in the dataset of credit card transaction datasets. This
dataset contains a mix of normal and fraud transactions; this proposed work
classifies and summarizes the various classification methods to classify the
transactions using various Machine Learning-based classifiers. The efficiency of the
method depends on the dataset and classifier used. The proposed summary will be
beneficial to the banker, credit card user, and researcher to analyze to prevent credit
card frauds. The future scope of this credit card fraud detection is to explore the
things in each and every associations and banks to live safe and happily life. The
data must be balanced in each place and we are getting the best results.
23
Title : A Review On Credit Card Fraud Detection Using Machine Learning
Author: Suresh K Shirgave, Chetan J. Awati, Rashmi More, Sonam S. Patil
Year : 2019
In recent years credit card fraud has become one of the growing problem. A large
financial loss has greatly affected individual person using credit card and also the
merchants and banks. Machine learning is considered as one of the most successful
technique to identify the fraud. This paper reviews different fraud detection
techniques using machine learning and compare them using performance measure
like accuracy, precision and specificity. The paper also proposes a FDS which uses
supervised Random Forest algorithm. With this proposed system the accuracy of
detecting fraud in credit card is increased. Further, the proposed system use learning
to rank approach to rank the alert and also effectively addresses the problem
concept drift in fraud detection. This paper has reviewed various machine learning
algorithm detect fraud in credit card transaction. The performances of all this
techniques are examined based on accuracy, precision and specificity metrics. We
have selected supervised learning technique Random Forest to classify the alert as
fraudulent or authorized. This classifier will be trained using feedback and delayed
supervised sample. Next it will aggregate each probability to detect alerts. Further
we proposed learning to rank approach where alert will be ranked based on priority.
The suggested method will be able to solve the class imbalance and concept drift
problem. Future work will include applying semi-supervised learning methods for
classification of alert in FDS
Title : Credit Card Fraud Detection and Prevention using Machine Learning
Author: S. Abinayaa, H. Sangeetha, R. A. Karthikeyan, K. Saran Sriram, D. Piyush
Year : 2020
This research focused mainly on detecting credit card fraud in real world. We
must collect the credit card data sets initially for qualified data set. Then provide
queries on the user's credit card to test the data set. After random forest algorithm
classification method using the already evaluated data set and providing current data
set[1]. Finally, the accuracy of the results data is optimised. Then the processing of a
number of attributes will be implemented, so that affecting fraud detection can be
found in viewing the representation of the graphical model. The techniques efficiency
24
is measured based on accuracy, flexibility, and specificity, precision. The results
obtained with the use of the Random Forest Algorithm have proved much more
effective.
3. FEASIBILITY STUDY:
Data Wrangling
In this section of the report will load in the data, check for cleanliness, and
then trim and clean given dataset for analysis. Make sure that the document steps
carefully and justify for cleaning decisions.
Data collection
The data set collected for predicting given data is split into Training set and
Test set. Generally, 7:3 ratios are applied to split the Training set and Test set. The
Data Model which was created using Random Forest, logistic, Decision tree
algorithms and Support vector classifier (SVC) are applied on the Training set and
based on the test result accuracy, Test set prediction is done.
Preprocessing
The data which was collected might contain missing values that may lead to
inconsistency. To gain better results data need to be preprocessed so as to improve
the efficiency of the algorithm. The outliers have to be removed and also variable
conversion need to be done.
25
CONSTRUCTION OF A PREDICTIVE MODEL
Machine learning needs data gathering have lot of past data‘s. Data gathering
have sufficient historical data and raw data. Before data pre-processing, raw data
can‘t be used directly. It‘s used to pre-process then, what kind of algorithm with
model. Training and testing this model working and predicting correctly with
minimum errors. Tuned model involved by tuned time to time with improving the
accuracy.
26
Data Gathering
Data Pre-Processing
Choose model
Train model
Test model
Tune model
Prediction
4. PROJECT REQUIREMENTS
4.1.General:
Requirements are the basic constrains that are required to develop a system.
Requirements are collected while designing the system. The following are the
requirements that are to be discussed.
1. Functional requirements
2. Non-Functional requirements
3. Environment requirements
A. Hardware requirements
B. software requirements
27
4.1.2 Functional requirements:
1. Problem define
2. Preparing data
3. Evaluating algorithms
4. Improving results
5. Prediction the result
1. Software Requirements:
2. Hardware requirements:
RAM : minimum 2 GB
28
4.3. SOFTWARE DESCRIPTION
others by uploading them to Anaconda Cloud, PyPI or other repositories. The default
installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python 3.7.
However, you can create new environments that include any version of Python
packaged with conda.
Anaconda. Now, if you are primarily doing data science work, Anaconda is
also a great option. Anaconda is created by Continuum Analytics, and it is a Python
distribution that comes preinstalled with lots of useful python libraries for data
science.
29
Anaconda is a distribution of the Python and R programming languages for
scientific computing (data science, machine learning applications, large-scale data
processing, predictive analytics, etc.), that aims to simplify package management
and deployment.
JupyterLab
Jupyter Notebook
Spyder
PyCharm
VSCode
Glueviz
Orange 3 App
RStudio
Anaconda Prompt (Windows only)
Anaconda PowerShell (Windows only)
30
Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution.
31
Navigator allows you to launch common Python programs and easily manage
conda packages, environments, and channels without using command-line
commands. Navigator can search for packages on Anaconda Cloud or in a local
Anaconda Repository.
Anaconda comes with many built-in packages that you can easily find with conda
list on your anaconda prompt. As it has lots of packages (many of which
are rarely used), it requires lots of space and time as well. If you have enough space,
time and do not want to burden yourself to install small utilities like JSON, YAML,
you better go for Anaconda.
4.3.2.CONDA :
This website acts as ―meta‖ documentation for the Jupyter ecosystem. It has a
collection of resources to navigate the tools and communities in this ecosystem, and
to help you get started.
32
dozens of programming languages". It was spun off from IPython in 2014 by
Fernando Perez.
Launching Jupyter Notebook App: The Jupyter Notebook App can be launched by
clicking on the Jupyter Notebook icon installed by Anaconda in the start menu
(Windows) or by typing in a terminal (cmd on Windows): ―jupyter notebook‖
This will launch a new browser window (or a new tab) showing the Notebook
Dashboard, a sort of control panel that allows (among other things) to select which
notebook to open.
When started, the Jupyter Notebook App can access only files within its start-
up folder (including any sub-folder). No configuration is necessary if you place your
notebooks in your home folder or subfolders. Otherwise, you need to choose
a Jupyter Notebook App start-up folder which will contain all the notebooks.
33
Save notebooks:Modifications to the notebooks are automatically saved every few
minutes. To avoid modifying the original notebook, make a copy of the notebook
document (menu file -> make a copy…) and save the modifications on the copy.
Executing a notebook:Download the notebook you want to execute and put it in your
notebook folder (or a sub-folder of it).
Launch the jupyter notebook app
Click on the menu Help -> User Interface Tour for an overview of the Jupyter
Notebook App user interface.
You can run the notebook document step-by-step (one cell a time) by
pressing shift + enter.
You can run the whole notebook in a single step by clicking on the menu Cell
-> Run All.
To restart the kernel (i.e. the computational engine), click on the menu Kernel
-> Restart. This can be useful to start over a computation from scratch (e.g.
variables are deleted, open files are closed, etc…).
PURPOSE: To support interactive data science and scientific computing across all
programming languages.
34
JUPYTER NOTEBOOK APP:
The Notebook Dashboard has other features similar to a file manager, namely
navigating folders and renaming/deleting files
35
WORKING PROCESS:
Download and install anaconda and get the most useful package for machine
learning in Python.
Load a dataset and understand its structure using statistical summaries and
data visualization.
Machine learning models, pick the best and build confidence that the
accuracy is reliable.
Python is a popular and powerful interpreted language. Unlike R, Python is a
complete language and platform that you can use for both research and
development and developing production systems. There are also a lot of modules
and libraries to choose from, providing multiple ways to do each task. It can feel
overwhelming.
The best way to get started using Python for machine learning is to complete a
project.
It will force you to install and start the Python interpreter (at the very least).
It will give you a bird‘s eye view of how to step through a small project.
It will give you confidence, maybe to go on to your own small projects.
When you are applying machine learning to your own datasets, you are working
on a project. A machine learning project may not be linear, but it has a number of
well-known steps:
Define Problem.
Prepare Data.
Evaluate Algorithms.
Improve Results.
Present Results.
36
The best way to really come to terms with a new platform or tool is to work
through a machine learning project end-to-end and cover the key steps. Namely,
from loading data, summarizing data, evaluating algorithms and making some
predictions.
4.3.4 PYTHON
INTRODUCTION:
37
not completely backward-compatible. Python 2 was discontinued with version 2.7.18
in 2020.
HISTORY:
Python was conceived in the late 1980s by Guido van Rossum at Centrum
Wiskunde & Informatica (CWI) in the Netherlands as a successor to ABC
programming language, which was inspired by SETL, capable of exception
handling and interfacing with the Amoeba operating system. Its implementation
began in December 1989. Van Rossum shouldered sole responsibility for the
project, as the lead developer, until 12 July 2018, when he announced his
―permanent vacation‖ from his responsibilities as Python's Benevolent Dictator For
Life, a title the Python community bestowed upon him to reflect his long-term
commitment as the project‘s chief decision-maker. In January 2019, active Python
core developers elected a 5-member ―Steering Council‖ to lead the project. As of
2021, the current members of this council are Barry Warsaw, Brett Cannon, Carol
Willing, Thomas Wouters, and Pablo Galindo Salgado.
Python 2.0 was released on 16 October 2000, with many major new features,
including a cycle-detecting garbage collector and support for Unicode.
Python 3.0 was released on 3 December 2008. It was a major revision of the
language that is not completely backward-compatible. Many of its major features
were backported to Python 2.6.x and 2.7.x version series. Releases of Python 3
include the 2 to 3 utility, which automates (at least partially) the translation of
Python 2 code to Python 3.
Python 2.7‘s end-of-life date was initially set at 2015 then postponed to 2020
out of concern that a large body of existing code could not easily be forward-ported
to Python 3. No more security patches or other improvements will be released for it.
With Python 2‘s end-of-life, only Python 3.6.x and later are supported.
38
Python 3.9.2 and 3.8.8 were expeditedas all versions of Python (including 2.7)
had security issues, leading to possible remote code execution and web cache
poisoning.
39
Rather than having all of its functionality built into its core, Python was
designed to be highly extensible (with modules). This compact modularity has made
it particularly popular as a means of adding programmable interfaces to existing
applications. Van Rossum's vision of a small core language with a large standard
library and easily extensible interpreter stemmed from his frustrations with ABC,
which espoused the opposite approach.
Python strives for a simpler, less-cluttered syntax and grammar while giving
developers a choice in their coding methodology. In contrast to Perl's "there is more
than one way to do it" motto, Python embraces a "there should be one— and
preferably only one —obvious way to do it" design philosophy. Alex Martelli,
a Fellow at the Python Software Foundation and Python book author, writes that "To
describe something as 'clever' is not considered a compliment in the Python culture."
Python's developers aim to keep the language fun to use. This is reflected in
its name a tribute to the British comedy group Monty Python and in occasionally
playful approaches to tutorials and reference materials, such as examples that refer
to spam and eggs (a reference to a Monty Python sketch) instead of the
standard foo and bar.
40
SYNTAX AND SEMANTICS :
INDENTATION :
41
The Try statement, which allows exceptions raised in its attached code block to
be caught and handled by except clauses; it also ensures that clean-up code in a
finally block will always be run regardless of how the block exits.
The raise statement, used to raise a specified exception or re-raise a caught
exception.
The class statement, which executes a block of code and attaches its local
namespace to a class, for use in object-oriented programming.
The def statement, which defines a function or method.
The with statement, which encloses a code block within a context manager (for
example, acquiring a lock before the block of code is run and releasing the lock
afterwards, or opening a file and then closing it), allowing resource-acquisition-is-
initialization (RAII) - like behavior and replaces a common try/finally idiom.
The break statement, exits from a loop.
The continue statement, skips this iteration and continues with the next item.
The del statement, removes a variable, which means the reference from the
name to the value is deleted and trying to use that variable will cause an error. A
deleted variable can be reassigned.
The pass statement, which serves as a NOP. It is syntactically needed to create
an empty code block.
The assert statement, used during debugging to check for conditions that should
apply.
The yield statement, which returns a value from a generator function and yield is
also an operator. This form is used to implement co-routines.
The return statement, used to return a value from a function.
The import statement, which is used to import modules whose functions or
variables can be used in the current program.
42
does not have a fixed data type associated with it. However, at a given time, a
variable will refer to some object, which will have a type. This is referred to
as dynamic typing and is contrasted with statically-typed programming languages,
where each variable may only contain values of a certain type.
Python does not support tail call optimization or first-class continuations, and,
according to Guido van Rossum, it never will.[80][81] However, better support for co-
routine-like functionality is provided, by extending Python's generators. Before 2.5,
generators were lazy iterators; information was passed uni-directionally out of the
generator. From Python 2.5, it is possible to pass information back into a generator
function, and from Python 3.3, the information can be passed through multiple stack
levels.
EXPRESSIONS :
Addition, subtraction, and multiplication are the same, but the behavior of division
differs. There are two types of divisions in Python. They are floor division (or
integer division) // and floating-point / division. Python also uses the ** operator
for exponentiation.
From Python 3.5, the new @ infix operator was introduced. It is intended to be
used by libraries such as NumPy for matrix multiplication.
From Python 3.8, the syntax :=, called the 'walrus operator' was introduced. It
assigns values to variables as part of a larger expression.
In Python, == compares by value, versus Java, which compares numerics by
value and objects by reference. (Value comparisons in Java on objects can be
performed with the equals() method.) Python's is operator may be used to
compare object identities (comparison by reference). In Python, comparisons
may be chained, for example A<=B<=C.
Python uses the words and, or, not for or its boolean operators rather than the
symbolic &&, ||, ! used in Java and C.
43
Python has a type of expression termed a list comprehension as well as a more
general expression termed a generator expression.
Anonymous functions are implemented using lambda expressions; however,
these are limited in that the body can only be one expression.
Conditional expressions in Python are written as x if c else y (different in order of
operands from the c ? x : y operator common to many other languages).
Python makes a distinction between lists and tuples. Lists are written as [1, 2, 3],
are mutable, and cannot be used as the keys of dictionaries (dictionary keys
must be immutable in Python). Tuples are written as (1, 2, 3), are immutable and
thus can be used as the keys of dictionaries, provided all elements of the tuple
are immutable. The + operator can be used to concatenate two tuples, which
does not directly modify their contents, but rather produces a new tuple
containing the elements of both provided tuples. Thus, given the variable t initially
equal to (1, 2, 3), executing t = t + (4, 5) first evaluates t + (4, 5), which yields (1,
2, 3, 4, 5), which is then assigned back to t, thereby effectively "modifying the
contents" of t, while conforming to the immutable nature of tuple objects.
Parentheses are optional for tuples in unambiguous contexts.
Python features sequence unpacking wherein multiple expressions, each
evaluating to anything that can be assigned to (a variable, a writable property,
etc.), are associated in an identical manner to that forming tuple literals and, as a
whole, are put on the left-hand side of the equal sign in an assignment
statement. The statement expects an iterable object on the right-hand side of the
equal sign that produces the same number of values as the provided writable
expressions when iterated through and will iterate through it, assigning each of
the produced values to the corresponding expression on the left.
Python has a "string format" operator %. This functions analogously ton printf
format strings in C, e.g. ―spam=%s eggs=%d‖ % (―blah‖,2) evaluates to
―spam=blah eggs=2‖. In Python 3 and 2.6+, this was supplemented by the
format() method of the str class, e.g. ―spam={0} eggs={1}‖.format(―blah‖,2).
Python 3.6 added "f-strings": blah = ―blah‖; eggs = 2; f‗spam={blah} eggs={eggs}‘
Strings in Python can be concatenated, by "adding" them (same operator as for
adding integers and floats). E.g. ―spam‖ + ―eggs‖returns ―spameggs‖. Even if your
44
strings contain numbers, they are still added as strings rather than integers. E.g.
―2‖ + ―2‖ returns ―2‖.
Python has various kinds of string literals:
o Strings delimited by single or double quote marks. Unlike in Unix
shells, Perl and Perl-influenced languages, single quote marks and double
quote marks function identically. Both kinds of string use the backslash (\) as
an escape character. String interpolation became available in Python 3.6 as
"formatted string literals".
o Triple-quoted strings, which begin and end with a series of three single or
double quote marks. They may span multiple lines and function like here
documents in shells, Perl and Ruby.
o Raw string varieties, denoted by prefixing the string literal with an r . Escape
sequences are not interpreted; hence raw strings are useful where literal
backslashes are common, such as regular expressions and Windows-style
paths. Compare "@-quoting" in C#.
Python has array index and array slicing expressions on lists, denoted as
a[Key], a[start:stop] or a[start:stop:step]. Indexes are zero-based, and negative
indexes are relative to the end. Slices take elements from the start index up to,
but not including, the stop index. The third slice parameter, called step or stride,
allows elements to be skipped and reversed. Slice indexes may be omitted, for
example a[:] returns a copy of the entire list. Each element of a slice is a shallow
copy.
45
Statements cannot be a part of an expression, so list and other
comprehensions or lambda expressions, all being expressions, cannot contain
statements. A particular case of this is that an assignment statement such as a=1
cannot form part of the conditional expression of a conditional statement. This has
the advantage of avoiding a classic C error of mistaking an assignment operator =
for an equality operator == in conditions: if (c==1) {…} is syntactically valid (but
probably unintended) C code but if c=1: … causes a syntax error in Python.
METHODS :
Methods on objects are functions attached to the object's class; the syntax
instance.method(argument) is, for normal methods and functions, syntactic sugar for
Class.method(instance, argument). Python methods have an explicit self parameter
access instance data, in contrast to the implicit self (or this) in some other object-
oriented programming languages (e.g., C++, Java, Objective-C, or Ruby). Apart from
this Python also provides methods, sometimes called d-under methods due to their
names beginning and ending with double-underscores, to extend the functionality of
custom class to support native functions such as print, length, comparison, support
for arithmetic operations, type conversion, and many more.
TYPING :
Python uses duck typing and has typed objects but untyped variable names.
Type constraints are not checked at compile time; rather, operations on an object
may fail, signifying that the given object is not of a suitable type. Despite being
dynamically-typed, Python is strongly-typed, forbidding operations that are not well-
defined (for example, adding a number to a string) rather than silently attempting to
make sense of them.
Python allows programmers to define their own types using classes, which
are most often used for object-oriented programming. New instances of classes are
constructed by calling the class (for example, SpamClass() or EggsClass()), and the
46
classes are instances of the metaclass type (itself an instance of itself), allowing
meta-programming and reflection.
Before version 3.0, Python had two kinds of classes: old-style and new-
style.The syntax of both styles is the same, the difference being whether the class
object is inherited from, directly or indirectly (all new-style classes inherit from object
and are instances of type). In versions of Python 2 from Python 2.2 onwards, both
kinds of classes can be used. Old-style classes were eliminated in Python 3.0.
The long-term plan is to support gradual typing and from Python 3.5, the syntax of
the language allows specifying static types but they are not checked in the default
implementation, CPython. An experimental optional static type checker
named mypy supports compile-time type checking.
5.SYSTEM DIAGRAMS
47
5.2 WORK FLOW DIAGRAM
Source Data
Training Testing
Dataset Dataset
48
Classification ML Best Model by Accuracy
Algorithms
Workflow Diagram
49
Use case diagrams are considered for high level requirement analysis of a
system. So when the requirements of a system are analyzed the functionalities are
captured in use cases. So, it can say that uses cases are nothing but the system
functionalities written in an organized manner.
5.4CLASS DIAGRAM:
50
Class diagram is basically a graphical representation of the static view of the
system and represents different aspects of the application. So a collection of class
diagrams represent the whole system. The name of the class diagram should be
meaningful to describe the aspect of the system. Each element and their
relationships should be identified in advance Responsibility (attributes and methods)
of each class should be clearly identified for each class minimum number of
properties should be specified and because, unnecessary properties will make the
diagram complicated. Use notes whenever required to describe some aspect of the
diagram and at the end of the drawing it should be understandable to the
developer/coder. Finally, before making the final version, the diagram should be
drawn on plain paper and rework as many times as possible to make it correct.
5.5ACTIVITY DIAGRAM:
51
Activity is a particular operation of the system. Activity diagrams are not only
used for visualizing dynamic nature of a system but they are also used to construct
the executable system by using forward and reverse engineering techniques. The
only missing thing in activity diagram is the message part. It does not show any
message flow from one activity to another. Activity diagram is some time considered
as the flow chart. Although the diagrams looks like a flow chart but it is not. It shows
different flow like parallel, branched, concurrent and single.
52
5.6SEQUENCE DIAGRAM:
Sequence diagrams model the flow of logic within your system in a visual
manner, enabling you both to document and validate your logic, and are commonly
used for both analysis and design purposes. Sequence diagrams are the most
popular UML artifact for dynamic modelling, which focuses on identifying the
behaviour within your system. Other dynamic modelling techniques include activity
diagramming, communication diagramming, timing diagramming, and interaction
overview diagramming. Sequence diagrams, along with class
diagrams and physical data models are in my opinion the most important design-
level models for modern business application development.
53
5.7 ENTITY RELATIONSHIP DIAGRAM (ERD)
6. LIST OF MODULES:
Data Pre-processing
Data Analysis of Visualization
54
Comparing Algorithm with prediction in the form of best accuracy result
Deployment Using Flask
Validation techniques in machine learning are used to get the error rate of the
Machine Learning (ML) model, which can be considered as close to the true error
rate of the dataset. If the data volume is large enough to be representative of the
population, you may not need the validation techniques. However, in real-world
scenarios, to work with samples of data that may not be a true representative of the
population of given dataset. To finding the missing value, duplicate value and
description of data type whether it is float variable or integer. The sample of data
used to provide an unbiased evaluation of a model fit on the training dataset while
tuning model hyper parameters.
A number of different data cleaning tasks using Python‘s Pandas library and
specifically, it focus on probably the biggest data cleaning task, missing values and it
able to more quickly clean data. It wants to spend lesstime cleaning data, and more
time exploring and modeling.
Some of these sources are just simple random mistakes. Other times, there
can be a deeper reason why data is missing. It‘s important to understand
55
these different types of missing data from a statistics point of view. The type of
missing data will influence how to deal with filling in the missing values and to detect
missing values, and do some basic imputation and detailed statistical approach
for dealing with missing data. Before, joint into code, it‘s important to understand the
sources of missing data. Here are some typical reasons why data is missing:
Users chose not to fill out a field tied to their beliefs about how the results would
be used or interpreted.
import libraries for access and functional purpose and read the given dataset
General Properties of Analyzing the given dataset
Display the given dataset in the form of data frame
show columns
shape of the data frame
To describe the data frame
Checking data type and information about dataset
Checking for duplicate data
Checking Missing values of data frame
Checking unique values of data frame
Checking count values of data frame
Rename and drop the given data frame
To specify the type of values
To create extra columns
56
MODULE DIAGRAM
57
6.1.2. DATA VALIDATION/ CLEANING/PREPARING PROCESS
Importing the library packages with loading given dataset. To analyzing the
variable identification by data shape, data type and evaluating the missing values,
duplicate values. A validation dataset is a sample of data held back from training
your model that is used to give an estimate of model skill while tuning model's and
procedures that you can use to make the best use of validation and test datasets
when evaluating your models. Data cleaning / preparing by rename the given dataset
and drop the column etc. to analyze the uni-variate, bi-variate and multi-variate
process. The steps and techniques for data cleaning will vary from dataset to
dataset. The primary goal of data cleaning is to detect and remove errors and
anomalies to increase the value of data in analytics and decision making.
Sometimes data does not make sense until it can look at in a visual form, such as
with charts and plots. Being able to quickly visualize of data samples and others is
an important skill both in applied statistics and in applied machine learning. It will
58
discover the many types of plots that you will need to know when visualizing data in
Python and how to use them to better understand your own data.
How to chart time series data with line plots and categorical quantities with
bar charts.
How to summarize data distributions with histograms and box plots.
59
60
MODULE DIAGRAM
FALSE POSITIVES (FP):A person who will pay predicted as defaulter. When actual
class is no and predicted class is yes. E.g. if actual class says this passenger did not
survive but predicted class tells you that this passenger will survive.
61
FALSE NEGATIVES (FN):A person who default predicted as payer. When actual
class is yes but predicted class in no. E.g. if actual class value indicates that this
passenger survived and predicted class tells you that passenger will die.
TRUE POSITIVES (TP):A person who will not pay predicted as defaulter. These are
the correctly predicted positive values which means that the value of actual class is
yes and the value of predicted class is also yes. E.g. if actual class value indicates
that this passenger survived and predicted class tells you the same thing.
TRUE NEGATIVES (TN):A person who default predicted as payer. These are the
correctly predicted negative values which means that the value of actual class is no
and value of predicted class is also no. E.g. if actual class says this passenger did
not survive and predicted class tells you the same thing.
In the next section you will discover exactly how you can do that in Python
with scikit-learn. The key to a fair comparison of machine learning algorithms is
ensuring that each algorithm is evaluated in the same way on the same data and it
62
can achieve this by forcing each algorithm to be evaluated on a consistent test
harness.
Logistic Regression
Random Forest
Decision Tree Classifier
Naive Bayes
63
ACCURACY:The Proportion of the total number of predictions that is correct
otherwise overall how often the model predicts correctly defaulters and non-
defaulters.
ACCURACY CALCULATION:
Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations. The question that this metric answer is of all passengers that
labeled as survived, how many actually survived? High precision relates to the low
false positive rate. We have got 0.788 precision which is pretty good.
F1 Score is the weighted average of Precision and Recall. Therefore, this score
takes both false positives and false negatives into account. Intuitively it is not as
easy to understand as accuracy, but F1 is usually more useful than accuracy,
64
especially if you have an uneven class distribution. Accuracy works best if false
positives and false negatives have similar cost. If the cost of false positives and false
negatives are very different, it‘s better to look at both Precision and Recall.
General Formula:
F1-Score Formula:
Sklearn:
In python, sklearn is a machine learning package which include a lot of ML
algorithms.
Here, we are using some of its modules like train_test_split,
DecisionTreeClassifier or Logistic Regression and accuracy_score.
65
NUMPY:
It is a numeric python module which provides fast maths functions for
calculations.
It is used to read data in numpy arrays and for manipulation purpose.
PANDAS:
Used to read and write different files.
Data manipulation can be done easily with data frames.
MATPLOTLIB:
Data visualization is a useful way to help with identify the patterns from
given dataset.
Data manipulation can be done easily with data frames.
6.2.2LOGISTIC REGRESSION
It is a statistical method for analysing a data set in which there are one or more
independent variables that determine an outcome. The outcome is measured with a
dichotomous variable (in which there are only two possible outcomes). The goal of
logistic regression is to find the best fitting model to describe the relationship between
the dichotomous characteristic of interest (dependent variable = response or outcome
variable) and a set of independent (predictor or explanatory) variables. Logistic
regression is a Machine Learning classification algorithm that is used to predict the
probability of a categorical dependent variable. In logistic regression, the dependent
variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0
(no, failure, etc.).
66
For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
The independent variables should be independent of each other. That is, the
model should have little.
67
MODULE DIAGRAM
68
6.2.3. RANDOM FOREST CLASSIFIER
69
70
MODULE DIAGRAM
It is one of the most powerful and popular algorithm. Decision-tree algorithm falls
under the category of supervised learning algorithms. It works for both continuous as
well as categorical output variables. Assumptions of Decision tree:
71
This process is continued on the training set until meeting a termination condition. It is
constructed in a top-down recursive divide-and-conquer manner. All the attributes
should be categorical. Otherwise, they should be discretized in advance. Attributes in
the top of the tree have more impact towards in the classification and they are
identified using the information gain concept.A decision tree can be easily over-fitted
generating too many branches and may reflect anomalies due to noise or outliers.
72
MODULE DIAGRAM
73
6.2.5. NAIVE BAYES ALGORITHM:
The Naive Bayes algorithm is an intuitive method that uses the probabilities of
each attribute belonging to each class to make a prediction. It is the
supervised learning approach you would come up with if you wanted to model
a predictive modeling problem probabilistically.
Naive bayes simplifies the calculation of probabilities by assuming that the
probability of each attribute belonging to a given class value is independent of
all other attributes. This is a strong assumption but results in a fast and
effective method.
The probability of a class value given a value of an attribute is called the
conditional probability. By multiplying the conditional probabilities together for
each attribute for a given class value, we have a probability of a data instance
belonging to that class. To make a prediction we can calculate probabilities of
the instance belonging to each class and select the class value with the
highest probability.
Naive Bayes is a statistical classification technique based on Bayes Theorem.
It is one of the simplest supervised learning algorithms.Naive Bayes classifier
is the fast, accurate and reliable algorithm. Naive Bayes classifiers have high
accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class
is independent of other features. For example, a loan applicant is desirable or
not depending on his/her income, previous loan and transaction history, age,
and location.
Even if these features are interdependent, these features are still considered
independently. This assumption simplifies computation, and that's why it is
considered as naive. This assumption is called class conditional
independence.
74
MODULE DIAGRAM
75
GIVEN INPUT EXPECTED OUTPUT
input : data
output : getting accuracy
6.3 DEPLOYMENT
When Ronacher and Georg Brand created a bulletin board system written in
Python, the Pocoo projects Werkzeug and Jinja were developed.
In April 2016, the Pocoo team was disbanded and development of Flask and
related libraries passed to the newly formed Pallets project.
76
Flask has become popular among Python enthusiasts. As of October 2020, it
has second most stars on GitHub among Python web-development frameworks, only
slightly behind Django, and was voted the most popular web framework in the
Python Developers Survey 2018.
MODULE DIAGRAM
77
GIVEN INPUT EXPECTED OUTPUT
input : data values
output : predicting output
6.3.2. FEATURES:
Flask was designed to be easy to use and extend. The idea behind Flask is
to build a solid foundation for web applications of different complexity. From then on
you are free to plug in any extensions you think you need. Also you are free to build
your own modules. Flask is great for all kinds of projects. It's especially good for
prototyping. Flask depends on two external libraries: the Jinja2 template engine and
the Werkzeug WSGI toolkit.
Still the question remains why use Flask as your web application framework if
we have immensely powerful Django, Pyramid, and don‘t forget web mega-
framework Turbo-gears? Those are supreme Python web frameworks BUT out-of-
the-box Flask is pretty impressive too with its:
Plus Flask gives you so much more CONTROL on the development stage
of your project. It follows the principles of minimalism and let you decide how you
will build your application.
78
Flask has a lightweight and modular design, so it easy to transform it to the web
framework you need with a few extensions without weighing it down
ORM-agnostic: you can plug in your favourite ORM e.g. SQLAlchemy.
Basic foundation API is nicely shaped and coherent.
Flask documentation is comprehensive, full of examples and well structured. You
can even try out some sample application to really get a feel of Flask.
It is super easy to deploy Flask in production (Flask is 100% WSGI 1.0 compliant‖)
HTTP request handling functionality
High Flexibility
The configuration is even more flexible than that of Django, giving you plenty of
solution for every production need.
To sum up, Flask is one of the most polished and feature-rich micro
frameworks, available. Still young, Flask has a thriving community, first-class
extensions, and an elegant API. Flask comes with all the benefits of fast templates,
strong WSGI features, thorough unit testability at the web application and library
level, extensive documentation. So next time you are starting a new project where
you need some good features and a vast number of extensions, definitely check out
Flask.
79
Overview of Python Flask Framework Web apps are developed to generate
content based on retrieved data that changes based on a user‘s interaction with the
site. The server is responsible for querying, retrieving, and updating data. This
makes web applications to be slower and more complicated to deploy than static
websites for simple applications.
Flask is an excellent web development framework for REST API creation. It is built
on top of Python which makes it powerful to use all the python features.
Flask is used for the backend, but it makes use of a templating language called
Jinja2 which is used to create HTML, XML or other markup formats that are returned
to the user via an HTTP request.
Flask is a web framework. This means flask provides you with tools, libraries
and technologies that allow you to build a web application. This web application can
be some web pages, a blog, a wiki or go as big as a web-based calendar application
or a commercial website.
80
Framework Flask is a web framework from Python language. Flask provides a
library and a collection of codes that can be used to build websites, without the need
to do everything from scratch. But Framework flask still doesn't use the Model View
Controller (MVC) method.
Flask Restful is an extension for Flask that adds support for building REST
APIs in Python using Flask as the back-end. It encourages best practices and is very
easy to set up. Flask restful is very easy to pick up if you're already familiar with
flask.
Flask is a web framework for Python, meaning that it provides functionality for
building web applications, including managing HTTP requests and rendering
templates and also we can add to this application to create our API.
The flask object implements a WSGI application and acts as the central
object. It is passed the name of the module or package of the application. Once it is
81
created it will act as a central registry for the view functions, the URL rules, template
configuration and much more.
The name of the package is used to resolve resources from inside the
package or the folder the module is contained in depending on if the package
parameter resolves to an actual python package (a folder with an __init__.py file
inside) or a standard module (just a .py file).
Parameters
After_Request(f)
82
The function is called with the response object, and must return a
response object. This allows the functions to modify or replace the response
before it is sent.
Parameters:
f (Callable[[Response], Response])
Return type
Callable[[Response], Response]
after_request_funcs: t.Dict[AppOrBlueprintKey,
t.List[AfterRequestCallable]]
This data structure is internal. It should not be modified directly and its
format may change at any time.
app_context()
83
An application context is automatically pushed
by RequestContext.push() when handling a request, and when running a CLI
command. Use this to manually create a context outside of these situations.
With app.app_context():
Init_db()
6.4 HTML
HTML stands for Hyper Text Markup Language. It is used to design web
pages using a markup language. HTML is the combination of Hypertext and Markup
language. Hypertext defines the link between the web pages. A markup language is
used to define the text document within tag which defines the structure of web
pages. This language is used to annotate (make notes for the computer) text so that
a machine can understand it and manipulate text accordingly. Most markup
languages (e.g. HTML) are human-readable. The language uses tags to define what
manipulation has to be done on the text.
84
<!DOCTYPE html> — This tag specifies the language you will write on the page. In
this case, the language is HTML 5.
<html> — This tag signals that from here on we are going to write in HTML code.
<head> — This is where all the metadata for the page goes — stuff mostly meant for
search engines and other computer programs.
Further Tags
Inside the <head> tag, there is one tag that is always included: <title>, but there are
others that are just as important:
<title>
This is where we insert the page name as it will appear at the top of the
browser window or tab.
85
<meta>
This is where information about the document is stored: character
encoding, name (page context), description.
Head Tag
<head>
<title>My First Webpage</title>
<meta charset="UTF-8">
<meta name="description" content="This field contains information about your page.
It is usually around two sentences long.">.
<meta name="author" content="Conor Sheils">
</header>
Adding Content
Next, we will make<body> tag.
The HTML <body> is where we add the content which is designed for viewing by
human eyes.
This includes text, images, tables, forms and everything else that we see on the
internet each day.
<h1>
<h2>
<h3>
<h4>
<h5>
<h6>
86
As you might have guessed <h1> and <h2> should be used for the most
important titles, while the remaining tags should be used for sub-headings and less
important text.
Search engine bots use this order when deciphering which information is most
important on a page.
And hit save. We will save this file as ―index.html‖ in a new folder called ―my
webpage.‖
Adding text to our HTML page is simple using an element opened with the tag
<p> which creates a new paragraph. We place all of our regular text inside the
element <p>.
Almost everything you click on while surfing the web is a link takes you to another
page within the website you are visiting or to an external site.
Links are included in an attribute opened by the <a> tag. This element is the first that
we‘ve met which uses an attribute and so it looks different to previously mentioned
tags.
<a href=http://www.google.com>Google</a>
87
IMAGE TAG
In today‘s modern digital world, images are everything. The <img> tag has
everything you need to display images on your site. Much like the <a> anchor
element, <img> also contains an attribute.
6.5 CSS
CSS stands for Cascading Style Sheets. It is the language for describing
the presentation of Web pages, including colours, layout, and fonts, thus making
our web pages presentable to the users.CSS is designed to make style sheets for
the web. It is independent of HTML and can be used with any XML-based markup
language. Now let‘s try to break the acronym:
CSS Syntax
Selector {
Property 1 : value;
Property 2 : value;
Property 3 : value;
For example:
h1
88
{
Color: red;
Text-align: center;
#unique
color: green;
CSS Comment
89
CSS How-To
Priority order
o Inline > Internal > External
INLINE CSS
INTERNAL CSS
With the help of style tag, we can apply styles within the HTML file
Redundancy is removed
But the idea of separation of concerns still lost
Uniquely applied on a single document
Example:
90
<style>
H1{
Color:red;
</style>
EXTERNAL CSS
With the help of <link> tag in the head tag, we can apply styles
Reference is added
File saved with .css extension
Redundancy is removed
The idea of separation of concerns is maintained
Uniquely applied to each document
Example:
<head>
<link rel= ―stylesheet‖ type= ―text/css‖ href= ―name of the CSS file‖>
</head>
h1{
91
CSS Selectors
Priority of Selectors
CSS Colors
CSS BACKGROUND
There are different ways by which CSS can have an effect on HTML
elements
Few of them are as follows:
o Color – used to set the color of the background
o Repeat – used to determine if the image has to repeat or not
and if it is repeating then how it should do that
o Image – used to set an image as the background
o Position – used to determine the position of the image
o Attachment – It basically helps in controlling the mechanism of
scrolling.
92
CSS BoxModel
7.CODING
MODULE – 1
PRE-PROCESSING
93
data = p.read_csv('creditcard.csv')
In [ ]:
import warnings
warnings.filterwarnings('ignore')
Before drop the given dataset
In [ ]:
data.head()
In [ ]:
#shape
data.shape
In [ ]:
data.columns
After drop the given dataset
In [ ]:
del data['TransactionDate']
In [ ]:
df=data.dropna()
In [ ]:
df.shape
In [ ]:
df.describe()
In [ ]:
del df['Merchant_id']
In [ ]:
df.columns
In [ ]:
df.info()
Checking duplicate value from dataframe
In [ ]:
#Checking for duplicate data
df.duplicated()
In [ ]:
#find sum of duplicate data
sum(df.duplicated())
In [ ]:
#Checking sum of missing values
df.isnull().sum()
In [ ]:
94
df.isForeignTransaction.unique()
In [ ]:
df.TransactionAmount.unique()
In [ ]:
df.isHighRiskCountry.unique()
In [ ]:
df.DailyChargebackAvgAmt.unique()
In [ ]:
p.Categorical(df['isFradulent']).describe()
In [ ]:
p.Categorical(df['6_MonthAvgChbkAmt']).describe()
In [ ]:
p.Categorical(df['AverageAmountTransactionDay']).describe()
In [ ]:
df.columns
In [ ]:
df['6_MonthChbkFreq'].value_counts()
In [ ]:
df['6_MonthAvgChbkAmt'].value_counts()
In [ ]:
df.corr()
After Pre-processing
In [ ]:
df.head()
In [ ]:
df.columns
In [ ]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['AverageAmountTransactionDay', 'TransactionAmount', 'Is_declined',
'TotalNumberOfDeclinesDay', 'isForeignTransaction', 'isHighRiskCountry',
'DailyChargebackAvgAmt', '6_MonthAvgChbkAmt', '6_MonthChbkFreq',
'isFradulent']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i]).astype(int)
In [ ]:
df.head(10)
In [ ]:
df.isnull().sum()
In [ ]:
95
df.tail(10)
MODULE – 2
VISUALIZATION
PropByVar(df, 'isFradulent')
In [ ]:
# Heatmap plot diagram
fig, ax = plt.subplots(figsize=(15,7))
s.heatmap(df.corr(), ax=ax, annot=True)
In [ ]:
plt.boxplot(df['AverageAmountTransactionDay'])
plt.show()
In [ ]:
import seaborn as s
s.boxplot(df['AverageAmountTransactionDay'], color='m')
In [ ]:
from sklearn.preprocessing import LabelEncoder
var_mod =['AverageAmountTransactionDay', 'TransactionAmount', 'Is_declined',
'TotalNumberOfDeclinesDay', 'isForeignTransaction', 'isHighRiskCountry',
'DailyChargebackAvgAmt', '6_MonthAvgChbkAmt', '6_MonthChbkFreq',
'isFradulent']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i]).astype(int)
In [ ]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df['AverageAmountTransactionDay'],df['DailyChargebackAvgAmt'])
ax.set_xlabel('AverageAmountTransactionDay')
ax.set_ylabel('DailyChargebackAvgAmt')
ax.set_title('Daily Transaction & Chargeback Amount')
plt.show()
In [ ]:
df.columns
In [ ]:
plt.plot(df["TransactionAmount"], df["DailyChargebackAvgAmt"], color='g')
plt.xlabel('TransactionAmount')
plt.ylabel('DailyChargebackAvgAmt')
plt.title('Credit Card Transaction')
plt.show()
Splitting Train / Test
97
In [ ]:
#preprocessing, split test and dataset, split response variable
X = df.drop(labels='isFradulent', axis=1)
#Response variable
y = df.loc[:,'isFradulent']
In [ ]:
#We'll use a test size of 20%. We also stratify the split on the response variable,
which is very important to do because there are so few fraudulent transactions.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
random_state=1, stratify=y)
print("Number of training dataset: ", len(X_train))
print("Number of test dataset: ", len(X_test))
print("Total number of dataset: ", len(X_train)+len(X_test))
In [ ]:
def qul_No_qul_bar_plot(df, bygroup):
dataframe_by_Group = p.crosstab(df[bygroup], columns=df["isFradulent"],
normalize = 'index')
dataframe_by_Group = n.round((dataframe_by_Group * 100), decimals=2)
ax = dataframe_by_Group.plot.bar(figsize=(15,7));
vals = ax.get_yticks()
ax.set_yticklabels(['{:3.0f}%'.format(x) for x in vals]);
ax.set_xticklabels(dataframe_by_Group.index,rotation = 0, fontsize = 15);
ax.set_title('Credit Card Transaction (%) (by ' + dataframe_by_Group.index.name
+ ')\n', fontsize = 15)
ax.set_xlabel(dataframe_by_Group.index.name, fontsize = 12)
ax.set_ylabel('(%)', fontsize = 12)
ax.legend(loc = 'upper left',bbox_to_anchor=(1.0,1.0), fontsize= 12)
rects = ax.patches
98
MODULE – 3
99
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state=1, stratify=y)
print("Number of training dataset: ", len(X_train))
print("Number of test dataset: ", len(X_test))
print("Total number of dataset: ", len(X_train)+len(X_test))
In [ ]:
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
from sklearn.metrics import confusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score, average_precision_score,
roc_auc_score
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logR= LogisticRegression()
logR.fit(X_train,y_train)
predictLR = logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
print("")
print(classification_report(y_test,predictLR))
print("")
cm1=confusion_matrix(y_test,predictLR)
print('Confusion Matrix result of Logistic Regression is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
100
data=[LR]
alg="Logistic Regression"
plt.figure(figsize=(5,5))
b=plt.bar(alg,data,color=("c"))
plt.title("Accuracy comparison of Earth Quake",fontsize=15)
plt.legend(b,data,fontsize=9)
In [ ]:
graph()
In [ ]:
TP = cm1[0][0]
FP = cm1[1][0]
FN = cm1[1][1]
TN = cm1[0][1]
print("True Positive :",TP)
print("True Negative :",TN)
print("False Positive :",FP)
print("False Negative :",FN)
print("")
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
FPR = FP/(FP+TN)
FNR = FN/(TP+FN)
print("True Positive Rate :",TPR)
print("True Negative Rate :",TNR)
print("False Positive Rate :",FPR)
print("False Negative Rate :",FNR)
print("")
PPV = TP/(TP+FP)
NPV = TN/(TN+FN)
print("Positive Predictive Value :",PPV)
print("Negative predictive value :",NPV)
In [ ]:
def plot_confusion_matrix(cm1, title='Confusion matrix-Logistic_Regression',
cmap=plt.cm.Blues):
target_names=['Predict','Actual']
plt.imshow(cm1, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = n.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
cm1=confusion_matrix(y_test, predictLR)
print('Confusion matrix-Logistic_Regression:')
print(cm1)
101
plot_confusion_matrix(cm1)
In [ ]:
MODULE – 4
In [ ]:
#import library packages
import pandas as p
import matplotlib.pyplot as plt
import seaborn as s
import numpy as n
In [ ]:
import warnings
warnings.filterwarnings('ignore')
In [ ]:
data=p.read_csv('creditcard.csv')
In [ ]:
del data['Merchant_id']
del data['TransactionDate']
In [ ]:
df=data.dropna()
In [ ]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['Is_declined','isForeignTransaction', 'isHighRiskCountry','isFradulent']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i]).astype(int)
In [ ]:
#preprocessing, split test and dataset, split response variable
X = df.drop(labels='isFradulent', axis=1)
#Response variable
y = df.loc[:,'isFradulent']
In [ ]:
'''We'll use a test size of 30%. We also stratify the split on the response variable,
which is very important to do because there are so few fraudulent transactions'''
102
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state=1, stratify=y)
print("Number of training dataset: ", len(X_train))
print("Number of test dataset: ", len(X_test))
print("Total number of dataset: ", len(X_train)+len(X_test))
In [ ]:
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
from sklearn.metrics import confusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score, average_precision_score,
roc_auc_score
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
rfc= RandomForestClassifier()
rfc.fit(X_train,y_train)
predictRF = rfc.predict(X_test)
print("")
print('Classification report of Random Forest Results:')
print("")
print(classification_report(y_test,predictRF))
print("")
cm1=confusion_matrix(y_test,predictRF)
print('Confusion Matrix result of Random Forest Classifier is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
103
data=[RF]
alg="Random Forest Classifier"
plt.figure(figsize=(5,5))
b=plt.bar(alg,data,color=("r"))
plt.title("Accuracy comparison of Earth Quake",fontsize=15)
plt.legend(b,data,fontsize=9)
In [ ]:
graph()
In [ ]:
TP = cm1[0][0]
FP = cm1[1][0]
FN = cm1[1][1]
TN = cm1[0][1]
print("True Positive :",TP)
print("True Negative :",TN)
print("False Positive :",FP)
print("False Negative :",FN)
print("")
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
FPR = FP/(FP+TN)
FNR = FN/(TP+FN)
print("True Positive Rate :",TPR)
print("True Negative Rate :",TNR)
print("False Positive Rate :",FPR)
print("False Negative Rate :",FNR)
print("")
PPV = TP/(TP+FP)
NPV = TN/(TN+FN)
print("Positive Predictive Value :",PPV)
print("Negative predictive value :",NPV)
In [ ]:
def plot_confusion_matrix(cm1, title='Confusion matrix-RandomForestClassifier',
cmap=plt.cm.Blues):
target_names=['Predict','Actual']
plt.imshow(cm1, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = n.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
cm1=confusion_matrix(y_test, predictRF)
print('Confusion matrix-RandomForestClassifier:')
print(cm1)
104
plot_confusion_matrix(cm1)
MODULE – 5
DECISION TREEALGORITHM
import pandas as p
import numpy as n
import matplotlib.pyplot as plt
import seaborn as s
In [ ]:
import warnings
warnings.filterwarnings('ignore')
In [ ]:
data=p.read_csv('creditcard.csv')
In [ ]:
del data['Merchant_id']
del data['TransactionDate']
In [ ]:
df=data.dropna()
In [ ]:
df.columns
In [ ]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['Is_declined','isForeignTransaction', 'isHighRiskCountry', 'isFradulent']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i]).astype(int)
In [ ]:
#preprocessing, split test and dataset, split response variable
X = df.drop(labels='isFradulent', axis=1)
#Response variable
y = df.loc[:,'isFradulent']
In [ ]:
'''We'll use a test size of 30%. We also stratify the split on the response variable,
which is very important to do because there are so few fraudulent transactions'''
105
print("Number of test dataset: ", len(X_test))
print("Total number of dataset: ", len(X_train)+len(X_test))
In [ ]:
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
from sklearn.metrics import confusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score, average_precision_score,
roc_auc_score
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
dtree= DecisionTreeClassifier()
dtree.fit(X_train,y_train)
predictDT = dtree.predict(X_test)
print("")
print('Classification report of Decision Tree Results:')
print("")
print(classification_report(y_test,predictDT))
print("")
cm1=confusion_matrix(y_test,predictDT)
print('Confusion Matrix result of Decision Tree Classifier is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
106
plt.title("Accuracy comparison of Earth Quake",fontsize=15)
plt.legend(b,data,fontsize=9)
In [ ]:
graph()
In [ ]:
TP = cm1[0][0]
FP = cm1[1][0]
FN = cm1[1][1]
TN = cm1[0][1]
print("True Positive :",TP)
print("True Negative :",TN)
print("False Positive :",FP)
print("False Negative :",FN)
print("")
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
FPR = FP/(FP+TN)
FNR = FN/(TP+FN)
print("True Positive Rate :",TPR)
print("True Negative Rate :",TNR)
print("False Positive Rate :",FPR)
print("False Negative Rate :",FNR)
print("")
PPV = TP/(TP+FP)
NPV = TN/(TN+FN)
print("Positive Predictive Value :",PPV)
print("Negative predictive value :",NPV)
In [ ]:
def plot_confusion_matrix(cm1, title='Confusion matrix-DecisionTreeClassifier',
cmap=plt.cm.Blues):
target_names=['Predict','Actual']
plt.imshow(cm1, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = n.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
cm1=confusion_matrix(y_test, predictDT)
print('Confusion matrix-DecisionTreeClassifier:')
print(cm1)
plot_confusion_matrix(cm1)
107
MODULE – 6
108
print("Number of test dataset: ", len(X_test))
print("Total number of dataset: ", len(X_train)+len(X_test))
In [ ]:
#According to the cross-validated MCC scores, the random forest is the best-
performing model, so now let's evaluate its performance on the test set.
from sklearn.metrics import confusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score, average_precision_score,
roc_auc_score
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
gnb = GaussianNB()
gnb.fit(X_train,y_train)
predictNB = gnb.predict(X_test)
print("")
print('Classification report of Naive Bayes Results:')
print("")
print(classification_report(y_test,predictNB))
print("")
cm1=confusion_matrix(y_test,predictNB)
print('Confusion Matrix result of Naive Bayes is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
109
plt.title("Accuracy comparison of Earth Quake",fontsize=15)
plt.legend(b,data,fontsize=9)
In [ ]:
graph()
In [ ]:
TP = cm1[0][0]
FP = cm1[1][0]
FN = cm1[1][1]
TN = cm1[0][1]
print("True Positive :",TP)
print("True Negative :",TN)
print("False Positive :",FP)
print("False Negative :",FN)
print("")
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
FPR = FP/(FP+TN)
FNR = FN/(TP+FN)
print("True Positive Rate :",TPR)
print("True Negative Rate :",TNR)
print("False Positive Rate :",FPR)
print("False Negative Rate :",FNR)
print("")
PPV = TP/(TP+FP)
NPV = TN/(TN+FN)
print("Positive Predictive Value :",PPV)
print("Negative predictive value :",NPV)
In [ ]:
def plot_confusion_matrix(cm1, title='Confusion matrix-Naive Bayes',
cmap=plt.cm.Blues):
target_names=['Predict','Actual']
plt.imshow(cm1, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = n.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
cm1=confusion_matrix(y_test, predictNB)
print('Confusion matrix-Naive Bayes:')
print(cm1)
plot_confusion_matrix(cm1
HTML CODE:
110
<!DOCTYPE html>
<html >
<!--From https://codepen.io/frytyler/pen/EGdtg-->
<head>
<meta charset="UTF-8">
<title>TITLE</title>
<link rel="stylesheet" href="{{ url_for('static', filename='css/bootstrap.min.css') }}">
<link href='https://fonts.googleapis.com/css?family=Pacifico' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Arimo' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Hind:300' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300'
rel='stylesheet' type='text/css'>
<style>
.back{
background-image: url("{{ url_for('static', filename='image/card.gif') }}");
background-repeat:no-repeat;
background-size:cover;
}
.white{
color:white;
}
.nspace{
margin:15px 15px 30px 30px;
padding:9px 10px;
background: palegreen;
width:500px
}
.space{
margin:10px 30px;
padding:10px 10px;
background: palegreen;
111
width:500px
}
.gap{
padding:10px 20px;
}
</style>
</head>
<body >
<div>
<div class="jumbotron">
<h1 style="text-align:center"> CREDIT CARD FRAUD DETECTION </h1>
</div>
<div class="back">
<!-- Main Input For Receiving Query to our ML -->
<form class="form-group" action="{{ url_for('predict')}}"method="post">
<div class="row">
<div class="gap col-md-6 ">
<label class="white" for="">AVERAGE AMOUNT TRANSACTION /
DAY</label>
<input type="number" class="space form-control" step="0.01"
name="AVERAGE AMOUNT TRANSACTION / DAY" placeholder="AVERAGE
AMOUNT TRANSACTION / DAY" required="required" /><br>
112
<select class="nspace form-control" name="IS DECLINED" id="IS
DECLINED">
<option value=0>NO</option>
<option value=1>YES</option>
</select>
</div>
113
<label class="white" for="">DAILY CHARGE BACK AVERAGE
AMOUNT</label>
<input type="number" class="space form-control" step="0.01"
name="DAILY CHARGE BACK AVERAGE AMOUNT" placeholder="DAILY
CHARGE BACK AVERAGE AMOUNT" required="required" /><br>
</div>
</div>
</form>
</div>
<br>
<br>
</body>
</html>
FLASK DEPLOY:
import numpy as np
from flask import Flask, request, jsonify, render_template
import pickle
import joblib
app = Flask(__name__)
model = joblib.load('lr.pkl')
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict',methods=['POST'])
def predict():
'''
For rendering results on HTML GUI
'''
int_features = [(x) for x in request.form.values()]
final_features = [np.array(int_features)]
print(final_features)
prediction = model.predict(final_features)
output = prediction[0]
if output==1:
output='Fraudulent'
else:
output="Not Fraudulent"
if __name__ == "__main__":
app.run(host="localhost", port=5000)
115
8. CONCLUSION
The analytical process started from data cleaning and processing, missing
value, exploratory analysis and finally model building and evaluation. The best
accuracy on public test set is higher accuracy score will be find out. This application
116
9. REFERENCES:
117
• K. Chan and T. Ray, ―An evolutionary algorithm to maintain diversity in the
parametric and the objective space,‖ in Proc. Int. Conf. Comput. Robot.
Auton. Syst. (CIRAS), 2005, pp. 13–16.
• A. Zhou, Q. Zhang, and Y. Jin, ―Approximating the set of Pareto-optimal
solutions in both the decision and objective spaces by an estimation of
distribution algorithm,‖ IEEE Trans. Evol. Comput., vol. 13, no. 5, pp. 1167–
1189, Oct. 2009.
• Y. Hu et al., ―A self-organizing multimodal multi-objective pigeon inspired
optimization algorithm,‖ Sci. China Inf. Sci., vol. 62, no. 7, Jul. 2019, Art. no.
70206.
• Y. Liu, G. G. Yen, and D. Gong, ―A multimodal multiobjective evolu tionary
algorithm using two-archive and recombination strategies,‖ IEEE Trans. Evol.
Comput., vol. 23, no. 4, pp. 660–674, Aug. 2019.
• J. Liang, Q. Guo, C. Yue, B. Qu, and K. Yu, ―A self-organizing multi objective
particle swarm optimization algorithm for multimodal multi objective
problems,‖ in Proc. 9th Int. Conf. Advances Swarm Intell. (ICSI), Shanghai,
China, Jun. 2018, pp. 550–560.
• Y. Wang, Z. Yang, Y. Guo, J. Zhu, and X. Zhu, ―A novel multi objective
competitive swarm optimization algorithm for multi-modal multi objective
problems,‖ in Proc. IEEE Congr. Evol. Comput. (CEC), Jun. 2019, pp. 271–
278.
• R. Shi, W. Lin, Q. Lin, Z. Zhu, and J. Chen, ―Multimodal multi-objective
optimization using a density-based one-by-one update strategy,‖ in Proc.
IEEE Congr. Evol. Comput. (CEC), Jun. 2019, pp. 295–301
• W. Zhang, G. Li, W. Zhang, J. Liang, and G. G. Yen, ―A cluster based PSO
with leader updating mechanism and ring-topology for multimodal multi-
objective optimization,‖ Swarm Evol. Comput., vol. 50, Nov. 2019, Art. no.
100569.
• R. Tanabe and H. Ishibuchi, ―A framework to handle multi-modal multi
objective optimization in decomposition-based evolutionary algorithms,‖ IEEE
Trans. Evol. Comput., vol. 24, no. 4, pp. 720–734, Aug. 2020.
• J. Sun, S. Gao, H. Dai, J. Cheng, M. Zhou, and J. Wang, ―Bi-objective elite
differential evolution algorithm for multivalued logic networks,‖ IEEE Trans.
Cybern., vol. 50, no. 1, pp. 233–246, Jan. 2020.
• W. Gu, Y. Yu, and W. Hu, ―Artificial bee colony algorithmbased parameter
estimation of fractional-order chaotic system with time delay,‖ IEEE/CAA J.
Automatica Sinica, vol. 4, no. 1, pp. 107–113, Jan. 2017.
• G. Wu, X. Shen, H. Li, H. Chen, A. Lin, and P. N. Suganthan, ―Ensemble of
differential evolution variants,‖ Inf. Sci., vol. 423, pp. 172–186, Jan. 2018.
• J. Zhang and A. C. Sanderson, ―Self-adaptive multi-objective differential
evolution with direction information provided by archived inferior solutions,‖ in
Proc. IEEE Congr. Evol. Comput. (World Congr. Comput. Intell.), Jun. 2008,
pp. 2801–2810.
• X. Qiu, J.-X. Xu, K. C. Tan, and H. A. Abbass, ―Adaptive cross generation
differential evolution operators for multiobjective opti mization,‖ IEEE Trans.
Evol. Comput., vol. 20, no. 2, pp. 232–244, Apr. 2016.
• J. J. Liang, B. Y. Qu, D. W. Gong, and C. T. Yue, ―Problemdefinitions and
evaluation criteria for the CEC 2019 specialsession on multimodal
118
multiobjective optimization,‖ Comput. Intell. Lab., Zhengzhou Univ.,
Zhengzhou, China, Tech. Rep., 2019.
• C. Yue, B. Qu, K. Yu, J. Liang, and X. Li, ―A novel scalable test problem suite
for multimodal multiobjective optimization,‖ Swarm Evol. Comput., vol. 48, pp.
62–71, Aug. 2019.
• S. Gao, M. Zhou, Y. Wang, J. Cheng, H. Yachi, and J. Wang, ―Den dritic
neuron model with effective learning algorithms for classification,
approximation, and prediction,‖ IEEE Trans. Neural Netw. Learn. Syst., vol.
30, no. 2, pp. 601–614, Feb. 2019.
• L. Zheng, G. Liu, C. Yan, and C. Jiang, ―Transaction fraud detection based
on total order relation and behavior diversity,‖ IEEE Trans. Comput. Social
Syst., vol. 5, no. 3, pp. 796–806, Sep. 2018.
119