Identifing Software Bugs or Not Using SMLT Model
Identifing Software Bugs or Not Using SMLT Model
Identifing Software Bugs or Not Using SMLT Model
MODEL
Abstract:
A software bug is an error, flaw or fault in a computer
program or system that causes it to produce an incorrect or unexpected result,
or to behave in unintended ways. Most bugs arise from mistakes and errors
made in either a program's design or its source code, or in components
and operating systems used by such programs. A few are caused
by compilers producing incorrect code. A program that contains many bugs,
and/or bugs that seriously interfere with its functionality, is said to be buggy.
Bugs usually appear when the programmer makes a logic error. The analysis of
dataset by supervised machine learning technique (SMLT) to capture several
information’s like, variable identification, uni-variate analysis, bi-variate and
multi-variate analysis, missing value treatments and analyze the data
validation, data cleaning/preparing and data visualization will be done on the
entire given dataset. To propose a machine learning-based method to classify
the software bug or not by best accuracy from comparing supervised
classification machine learning algorithms.
Existing System:
Insights are generated from the feature importance ranks that are computed
by either CS or CA methods. However, the choice between the CS and CA
methods to derive those insights remains arbitrary, even for the same
classifier. In addition, the choice of the exact feature important method is
seldom justified. In other words, several prior studies use feature importance
methods interchangeably without any specific rationale, even though
different methods compute the feature importance ranks differently.
Therefore, in this study, we set out to estimate the extent to which feature
importance ranks that are computed by CS and CA methods differ.
Drawbacks:
• They have not used Machine Learning Method.
• Accuracy is not predicted.
INTRODUCTION
Domain overview
Data Science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and
systems to extract knowledge and insights from structured and unstructured data, and apply
knowledge and actionable insights from data across a broad range of application domains.
The term "data science" has been traced back to 1974, when Peter Naur proposed it as an
alternative name for computer science. In 1996, the International Federation of Classification Societies
became the first conference to specifically feature data science as a topic. However, the definition was
still in flux.
Data Scientist:
Data scientists examine which questions need answering and where to find the related data. They
have business acumen and analytical skills as well as the ability to mine, clean, and present data.
Businesses use data scientists to source, manage, and analyze large amounts of unstructured data.
• ARTIFICIAL INTELLIGENCE
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are
programmed to think like humans and mimic their actions. The term may also be applied to any
machine that exhibits traits associated with a human mind such as learning and problem-solving.
Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural
intelligence displayed by humans or animals. Leading AI textbooks define the field as the study of
"intelligent agents" any system that perceives its environment and takes actions that maximize its
chance of achieving its goals. Some popular accounts use the term "artificial intelligence" to describe
machines that mimic "cognitive" functions that humans associate with the human mind, such as
"learning" and "problem solving", however this definition is rejected by major AI researchers.
Natural Language Processing (NLP):
Natural language processing (NLP) allows machines to read and understand human language. A
sufficiently powerful natural language processing system would enable natural-language user
interfaces and the acquisition of knowledge directly from human-written sources, such as newswire
texts. Some straightforward applications of natural language processing include information
retrieval, text mining, question answering and machine translation.
MACHINE LEARNING
Machine learning is to predict the future from past data. Machine learning (ML) is a type of
artificial intelligence (AI) that provides computers with the ability to learn without being explicitly
programmed. Machine learning focuses on the development of Computer Programs that can change
when exposed to new data and the basics of Machine Learning, implementation of a simple machine
learning algorithm using python. Process of training and prediction involves use of specialized
algorithms. It feed the training data to an algorithm, and the algorithm uses this training data to give
predictions on a new test data. Machine learning can be roughly separated in to three categories. There
are supervised learning, unsupervised learning and reinforcement learning.
Preparing the Dataset
Attribute Information:
• 1. loc : numeric % McCabe's line count of code
• 2. v(g) : numeric % McCabe "cyclomatic complexity"
• 3. ev(g) : numeric % McCabe "essential complexity"
• 4. iv(g) : numeric % McCabe "design complexity"
• 5. n : numeric % Halstead total operators + operands
• 6. v : numeric % Halstead "volume"
• 7. l : numeric % Halstead "program length"
• 8. d : numeric % Halstead "difficulty"
• 9. i : numeric % Halstead "intelligence"
• 10. e : numeric % Halstead "effort“
• 11. b : numeric % Halstead
• 12. t : numeric % Halstead's time estimator
• 13. lOCode : numeric % Halstead's line count
• 14. lOComment : numeric % Halstead's count of lines of comments
• 15. lOBlank : numeric % Halstead's count of blank lines
• 16. lOCodeAndComment: numeric
• 17. uniq_Op : numeric % unique operators
• 18. uniq_Opnd : numeric % unique operands
• 19. total_Op : numeric % total operators
• 20. total_Opnd : numeric % total operands
• 21: branchCount : numeric % of the flow graph
• 22. defects : {false,true} % module has/has not one or more
Proposed System:
Exploratory Data Analysis:
• Machine learning supervised classification algorithms will be used to
give the given dataset and extract patterns, which would help in
classifying the reviews, thereby helping the apps for making better
decisions of their features in the future.
Data Wrangling:
• In this section of the report, you will load in the data, check for
cleanliness, and then trim and clean your dataset for analysis.
Data collection:
• The data set collected for classifying the given data is split into Training
set and Test set. Generally, 70:30 percentage are applied to split the
Training set and Test set. The Data Model which was created using the
SMLT is applied on the Training set and based on the test result
accuracy, Test set prediction is done.
Advantages:
• Machine Learning method is implemented.
• Pre-Processing and analysing the data.
• Performance metrics of different algorithm are compared and the better
prediction is done.
Literature survey:
General
A literature review is a body of text that aims to review the critical points of current knowledge on
and/or methodological approaches to a particular topic. It is secondary sources and discuss published
information in a particular subject area and sometimes information in a particular subject area within a
certain time period. Its ultimate goal is to bring the reader up to date with current literature on a topic
and forms the basis for another goal, such as future research that may be needed in the area and
precedes a research proposal and may be just a simple summary of sources. Usually, it has an
organizational pattern and combines both summary and synthesis.
A summary is a recap of important information about the source, but a synthesis is a re-
organization, reshuffling of information. It might give a new interpretation of old material or combine
new with old interpretations or it might trace the intellectual progression of the field, including major
debates. Depending on the situation, the literature review may evaluate the sources and advise the
reader on the most pertinent or relevant of them
Title : A Systematic Literature Review of Software Defect Prediction: Research Trends, Datasets,
Methods and Frameworks
Author: Romi Satria Wahono
Year : 2015
Recent studies of software defect prediction typically produce datasets, methods and frameworks
which allow software engineers to focus on development activities in terms of defect-prone code, thereby
improving software quality and making better use of resources. Many software defect prediction datasets,
methods and frameworks are published disparate and complex, thus a comprehensive picture of the
current state of defect prediction research that exists is missing. This literature review aims to identify and
analyze the research trends, datasets, methods and frameworks used in software defect prediction research
betweeen 2000 and 201377.46% of the research studies are related to classification methods, 14.08% of the
studies focused on estimation methods, and 1.41% of the studies concerned on clustering and association
methods. In addition, 64.79% of the research studies used public datasets and 35.21% of the research studies
used private datasets. Nineteen different methods have been applied to predict software defects. From the
nineteen methods, seven most applied methods in software defect prediction are identified. Researchers
proposed some techniques for improving the accuracy of machine learning classifier for software defect
prediction by ensembling some machine learning methods, by using boosting algorithm, by adding feature
selection and by using parameter optimization for some classifiers. The results of this research also
identified three frameworks that are highly cited and therefore influential in the software defect prediction
field. They are Menzies et al. Framework, Lessmann et al. Framework, and Song et al. Framework.
Title : Anomaly-Based Bug Prediction, Isolation, and Validation: An Automated Approach for
Software Debugging
Author Martin Dimitrov and Huiyang Zhou
Year : 2009
Software defects, commonly known as bugs, present a serious challenge for system reliability and
dependability. Once a program failure is observed, the debugging activities to locate the defects are
typically nontrivial and time consuming. In this paper, we propose a novel automated approach to pin-
point the root-causes of software failures. Our proposed approach consists of three steps. The first step is
bug prediction, which leverages the existing work on anomaly-based bug detection as exceptional
behavior during program execution has been shown to frequently point to the root cause of a software
failure. The second step is bug isolation, which eliminates false-positive bug predictions by checking
whether the dynamic forward slices of bug predictions lead to the observed program failure. The last step
is bug validation, in which the isolated anomalies are validated by dynamically nullifying their effects
and observing if the program still fails. The whole bug prediction, isolation and validation process is fully
automated and can be implemented with efficient architectural support. Our experiments with 6
programs and 7 bugs, including a real bug in the gcc 2.95.2 compiler, show that our approach is highly
effective at isolating only the relevant anomalies. Compared to state-of-art debugging techniques, our
proposed approach pinpoints the defect locations more accurately and presents the user with a much
smaller code set to analyze. Categories and Subject Descriptors C.0 [Computer Systems Organization]:
Hardware/Software interfaces; D.2.5 [Software Engineering]: Testing and Debugging – debugging aids.
• Title : An Empirical Study on the Use of Defect Prediction for Test Case Prioritization
• Author: David Paterson , Jose Campos , Rui Abreu‡
• Year : 2019
• Test case prioritization has been extensively researched as a means for reducing the time taken to
discover regressions in software. While many different strategies have been developed and
evaluated, prior experiments have shown them to not be effective at prioritizing test suites to find
real faults. This paper presents a test case prioritization strategy based on defect prediction, a
technique that analyzes code features – such as the number of revisions and authors — to estimate
the likelihood that any given Java class will contain a bug. Intuitively, if defect prediction can
accurately predict the class that is most likely to be buggy, a tool can prioritize tests to rapidly detect
the defects in that class. We investigated how to configure a defect prediction tool, called Schwa, to
maximize the likelihood of an accurate prediction, surfacing the link between perfect defect
prediction and test case prioritization effectiveness. Using 6 real-world Java programs containing 395
real faults, we conducted an empirical evaluation comparing this paper’s strategy, called G-clef,
against eight existing test case prioritization strategies. The experiments reveal that using defect
prediction to prioritize test cases reduces the number of test cases required to find a fault by on
average 9.48% when compared with existing coverage-based strategies, and 10.5% when compared
with existing history-based strategies.
•
Project Goals
• Exploration data analysis of variable identification
• Loading the given dataset
• Import required libraries packages
• Analyze the general properties
• Find duplicate and missing values
• Checking unique and count values
Data Gathering
Data Pre-Processing
Choose model
Train model
Test model
Tune model
Prediction
Project Requirements
General:
• Requirements are the basic constrains that are required to develop a system. Requirements are collected while designing
the system. The following are the requirements that are to be discussed.
1. Functional requirements
2. Non-Functional requirements
3. Environment requirements
• A. Hardware requirements
• B. software requirements
Non-Functional Requirements:
• Process of functional steps,
• Problem define
• Preparing data
• Evaluating algorithms
• Improving results
• Prediction the result
Environmental Requirements:
• 1. Software Requirements:
• Operating System : Windows
• Tool : Anaconda with Jupyter Notebook
• 2. Hardware requirements:
• Processor : Pentium IV/III
• Hard disk : minimum 80 GB
• RAM : minimum 2 GB
• SOFTWARE DESCRIPTION
• Anaconda is a free and open-source distribution of the Python and R programming languages
for scientific computing (data science, machine learning applications, large-scale data
processing, predictive analytics, etc.), that aims to simplify package management and deployment.
Package versions are managed by the package management system “Conda”.
ANACONDA NAVIGATOR
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda®
distribution that allows you to launch applications and easily manage conda packages, environments,
and channels without using command-line commands. Navigator can search for packages on
Anaconda.org or in a local Anaconda Repository.
JUPYTER NOTEBOOK
This website acts as “meta” documentation for the Jupyter ecosystem. It has a collection of
resources to navigate the tools and communities in this ecosystem, and to help you get started.
• PYTHON
1. Software Requirements:
Operating System : Windows
Tool : Anaconda with Jupyter Notebook
2. Hardware requirements:
Processor : Pentium IV/III
Hard disk : minimum 80 GB
RAM : minimum 2 GB