Kamlesh Mooc File
Kamlesh Mooc File
Kamlesh Mooc File
On
B.Tech in CSE
By
SESSION (2023-2024)
CERTIFICATE
THIS IS TO CERTIFY THAT KAMLESH SINGH MEHTA HAS SATISFACTORILY PRESENTED MOOC
BASED SEMINAR ON THE COURSE TITLE DATA SCIENCE FOUNDATION COURSE IN PARTIAL
FULLFILLMENT OF THE SEMINAR PRESENTATION REQUIREMENT IN 3RD SEMESTER OF B.TECH.
DEGREE COURSE PRESCRIBED BY GRAPHIC ERA HILL UNIVERSITY DURING THE ACADEMIC
SESSION 2023-2024
SIGNATURE
TABLE OF CONTENT
2. Introduction 2
3. Introduction to data science 3
4. Data science life cycle 5
5. Anomaly detection 6
6. Association rule minning 7
7. Introduction to machine learning 8
8. Languages for data science 9
9. conclusion 11
ACKNOWLEDGEMENT
I take this opportunity to express my profound gratitude and deep regards to my guide Mr. Ravindra
Koranga for her exemplary guidance, monitoring and constant encouragement throughout the course.
The blessing, help and guidance given by her time to time helped me throughout the project. The success
and final outcome of this course required a lot of guidance and assistance from many people and I am
extremely privileged to have got this all along the completion of my report. All that I have Done is only
due to such supervision and assistance and I would not forget to thank them. I am Thankful to and
fortunate enough to get constant encouragement, support and guidance from all the People around me
which helped me in successfully completing my online course.
INTRODUCTION
The following seminar report provides an overview of the Data Science course offered on the
My greatlearning platform. The course is designed to introduce learners to the fundamental
concepts and techniques of Data Science. The report is structured week-wise, highlighting the
key topics covered in each week of the course. Throughout the course, participants engage in
hands-on programming assignments, quizzes, and projects that allow them to apply the
concepts learned in each week. By the end of the course, learners have a solid understanding of
the foundational concepts and techniques of Data Science and are equipped to apply them to
real-world problems.
Introduction to data science
Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract insights and knowledge from structured and unstructured data. It combines
expertise from various domains such as statistics, mathematics, computer science, and domain-
specific knowledge to analyze and interpret complex data sets. The primary goal of data science
is to uncover hidden patterns, trends, and information that can inform decision-making and
drive innovation.
Here is an introduction to key concepts within data science:
Data Collection:Data science begins with the collection of relevant data. This data can come
from various sources, including sensors, databases, social media, and more.
Data can be categorized into structured (tabular data with a well-defined schema) and
unstructured (text, images, videos) forms.
Data Cleaning and Preprocessing: Raw data is often messy and may contain errors or
missing values. Data scientists perform cleaning and preprocessing to ensure data quality and
consistency.
This step involves handling outliers, dealing with missing values, and transforming data into a
suitable format.
Machine Learning:Machine learning algorithms are employed to build predictive models and
make data-driven decisions.
Supervised learning involves training a model on labeled data, while unsupervised learning
deals with unlabeled data to discover patterns.
Model Evaluation and Validation:Once a model is trained, it needs to be evaluated to
ensure its performance on unseen data.Techniques such as cross-validation help in assessing a
model's generalizability.
With the increasing volume of data, tools and techniques for handling big data become
essential. Distributed computing frameworks like Apache Hadoop and Apache Spark are
commonly used.
Data science life cycle
The data science life cycle is a process that involves several steps to extract insights from data.
Here are the some steps of the data mining life cycle discussed below:
Data acquisition: Data acquisition is the process of sampling signals that measure real-world
physical conditions and converting the resulting samples into digital numeric values that can be
manipulated by a computer. The process involves several components, including sensors, signal
conditioning circuitry, and analog-to-digital converters. The data acquisition process is usually
part of a larger data science life cycle, which includes several steps such as business
understanding, data understanding, data preparation, modeling, evaluation, and deployment.
Data preprocessing: Data preprocessing is a crucial step in the data mining process that
involves cleaning, transforming, and integrating raw data to make it suitable for analysis. The
goal of data preprocessing is to improve the quality of the data and to make it more suitable for
the specific data mining task.
Machine learning algorithms: Machine learning algorithms are computational models that
allow computers to understand patterns and forecast or make judgments based on data without
the need for explicit programming. These algorithms form the foundation of modern artificial
intelligence and are used in a wide range of applications, including image and speech
recognition, natural language processing, recommendation systems, fraud detection,
autonomous cars, and more.
Pattern evaluation: In data mining, pattern evaluation is the process of assessing the quality
of discovered Patterns . The quality of patterns can be measured in terms of how accurately
they represent the underlying data, how interesting or useful they are, or how well they can be
used to predict future data.
There are several techniques used for anomaly detection in data mining, such as:
Clustering: Clustering is a technique that groups similar data points together based on their
characteristics. Anomalies are identified as data points that do not belong to any cluster.
Regression: Regression is a technique that predicts a continuous value for a new observation
based on its characteristics. Anomalies are identified as observations that have a large residual
error.
Density-Based Methods: Density-based methods identify anomalies as data points that have a
low probability of being generated by the underlying data distribution.
Distance-Based Methods: Distance-based methods identify anomalies as data points that are far
away from the rest of the data points in the dataset.
Statistical Methods: Statistical methods identify anomalies as data points that have a low
probability of occurring based on the statistical properties of the dataset.
Association rule mining
Association rule mining is a data mining technique that is used to discover interesting
relationships between variables in large datasets. It is a type of unsupervised learning that
identifies patterns or associations between items in a dataset.
The goal of association rule mining is to identify the relationships between variables that occur
frequently in the dataset. The relationships are represented as rules of the form “if X then Y”,
where X and Y are sets of items. There are several algorithms used for association rule mining,
such as:
Apriori: This algorithm is used to find frequent itemsets in a dataset. It works by generating
candidate itemsets and pruning those that do not meet the minimum support threshold.
Eclat: This algorithm is similar to Apriori but uses a depth-first search strategy to find frequent
itemsets.
FP-Growth: This algorithm is used to find frequent itemsets in a dataset. It works by building a
tree-like structure called a frequent pattern tree (FP-tree) and mining the tree to find frequent
itemsets.
Association rule mining has several applications, such as market basket analysis, web usage
mining, and medical diagnosis 1. For example, in market basket analysis, association rule mining
can be used to identify which products are frequently purchased together.
Introduction to machine learning
Machine learning is a subfield of artificial intelligence that involves the development of
algorithms and statistical models that enable computers to automatically learn from data and
make predictions. It is a powerful tool that has been used in various domains such as image
recognition, natural language processing, and speech recognition. There are four major
categories of machine learning: supervised, unsupervised, reinforcement, and semi-supervised.
Supervised learning is a type of machine learning where the algorithm is trained on labeled
data, which means that the input data has a corresponding output label. Unsupervised learning,
on the other hand, is a type of machine learning where the algorithm is trained on unlabeled
data, which means that the input data does not have any corresponding output label.
Reinforcement learning is a type of machine learning where the algorithm learns by interacting
with an environment and receiving feedback in the form of rewards or penalties. Semi-
supervised learning is a type of machine learning where the algorithm is trained on a
combination of labeled and unlabeled data.
Languages for data science
Data science is a rapidly growing field that involves the use of statistical and computational
methods to extract insights from data. There are several programming languages that are
commonly used in data science, including Python, R, SQL, Javaand many more.
Here we are going to know only about Python and R.
Python
Python is a high-level, interpreted, and general-purpose dynamic programming language that
focuses on code readability. It was created by Guido van Rossum in 1991. Python is widely used
in various domains such as web development, scientific computing, data analysis, artificial
intelligence, and more.
Python is known for its simplicity, readability, and ease of use. It has a large and active
community that contributes to its development and maintenance. Python has a vast collection
of libraries and frameworks that make it easy to perform complex tasks with minimal coding.
Some of the popular libraries and frameworks in Python include NumPy, Pandas, Matplotlib,
Django, Flask, and TensorFlow.
Python is an interpreted language, which means that the code is executed line by line. This
makes it easy to test and debug code. Python is also a dynamically typed language, which
means that the data type of a variable is determined at runtime . This makes it easy to write
code quickly without worrying about data types.
R
R is a programming language that is widely used for statistical computing and graphics. It was
created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, in the
early 1990s. R is an open-source language that is available on various platforms, including
Windows, Linux, and Mac.
R is known for its powerful data visualization capabilities, which allow users to create high-
quality graphs and charts. It also has a wide range of statistical and machine learning algorithms
that can be used for data analysis and modeling. R is a dynamically typed language, which
means that the data type of a variable is determined at runtime. This makes it easy to write
code quickly without worrying about data types.
Conclusion
Over the course of the past five weeks, we have delved into various topics and concepts in the
field of data science. We have covered a wide range of topics, including data science life cycle,
data mining machine learning and more. Each week has provided us with valuable insights and
practical knowledge that can be applied to real-world problems.
Certificate