PCE20CS705 - Rameshber Goswami - ITSREPORT
PCE20CS705 - Rameshber Goswami - ITSREPORT
PCE20CS705 - Rameshber Goswami - ITSREPORT
(Session 2022-23)
I hereby declare that the work which is being presented in the Industrial Training report titled
Machine Learning & Data Science in partial fulfillment for the award of the Degree of
Bachelor of Technology in Computer Engineering and submitted to the Department of
Computer Engineering, Poornima College of Engineering, Jaipur, is an authentic record of
my work carried out at Poornima Institute of Engineering & Technology, Jaipur (Rajasthan)
during the session 2022-23.
I have not submitted the matter presented in this report anywhere for the award of any other
Degree.
(ii)
Training Certificate from Company
(iii)
DEPARTMENT OF COMPUTER ENGINEERING
Date: 10/10/2022
CERTIFICATE
This is to certify that the Industrial Training report Machine Learning & Data
Engineering during the session 2022-23. The industrial training work is found
ACKNOWLEDGEMENT
I would like to convey my profound sense of reverence and admiration to my supervisor Mr.
Deepak Moud, HOD (Computer Engineering), and Dr. Uday Pratap Singh, Assistant Professor,
Poornima Institute of Engineering & Technology, Jaipur (Rajasthan) for his intense concern,
attention, priceless direction, guidance and encouragement throughout this internship.
I extend my heartiest gratitude to Dr. Nikita Jain, Coordinator-Industrial & Head, Department
of Computer Engineering, Poornima College of Engineering, for unvarying support, guidance,
and motivation during the course of this research.
I am grateful to Dr. Mahesh Bundele, Director, Poornima College of Engineering for his
helping attitude a keen interest in completing this training work in time.
I would like to express my deep sense of gratitude towards the management of Poornima College
of Engineering including Dr. S. M. Seth, Chairman Emeritus, Poornima Group, and former
Director NIH, Roorkee, Shri Shashikant Singhi, Chairman, Poornima Group, Mr. M. K. M.
Shah, Director Admin & Finance, Poornima Group, and Ar. Rahul Singhi, Director of
Poornima Group for the establishment of the institute and providing facilities for my studies.
I am deeply thankful to my parents and all other family members for their blessings and
inspiration. Last, but not least I would like to give special thanks to God who enabled me to
complete my training work on time.
Rameshber Goswami
PCE20CS705
(v)
TABLE OF CONTENTS
Title Page i
Candidate’s Declaration ii
Training Certificate iii
Acknowledgment v
Table of Contents vi-vii
List of Tables viii
List of Figures ix
Abstract 1
Chapter 1: Introduction 2-6
1.1 About company 2
1.2 Training Platform 2
1.3 Training Starting Date 2
1.4 Training Ending Date 3
1.5 Total Training Duration 3
1.6 Date of Certification 3
1.7 Training Pictures/Images 4-6
1.8 Conclusion 6
Chapter 2: Technical Training Platform 7-8
2.1 Introduction 7
2.2 Reason for selecting this platform 7
2.3 Profile of Organization 7
2.4 Conclusion 8
LIST OF TABLES
LIST OF FIGURES
1. Training Photo 1 4
2. Training Photo 2 4
3. Training Photo 3 5
4. Training Photo 4 5
5. Training Photo 5 6
6. NumPy Example 10
7. Data Visualization 11
8. Type Of Plots 11
9. Matplotlib Example 12
12. Project 24
ABSTRACT
The major of Machine Learning is concerned with the question of how to construct
computer programs that automatically enhance experiences. So, your answer is in your
data. Machine Learning is considered a subset of AI, which uses statistical methods to
enable machines to improve with experience. It enables a computer system to make
decisions to carry out a certain task. These programs or algorithms are designed in such a
way, which they can learn and enhance over time by observing new data. Machine
Learning aims to derive meaning from data. Thus, data is the key to unlocking Machine
Learning. The more qualified data ML has, the more accurate the ML algorithm becomes.
Data science is the study of data to establish its origin, content matter, and how it can be
of benefit. It is about equipping you with how to extract meaning from complex large
amounts of data. The data either be structured or unstructured, and the goal is to obtain
valuable insights about business or market patterns to help inform business decisions.
Data scientists are specialists who work to convert raw data into meaningful business
matters. They are usually trained and skilled in algorithmic coding, data mining, machine
learning, and statistics. Data science also incorporates other fields like mathematics,
statistics, and computation to understand and present data. The two fields are similar in
that squares are like rectangles, but rectangles are not squares. Data science is the
rectangle, while machine learning is the square; creating something different requires a
unique skill set. Data science involves researching, building, and interpreting a model you
have built, while machine learning involves producing that model. Data science uses a
scientific approach to obtain meaning from data, while machine learning deals with
system programming to automate and improve learning from data. Machine learning
cannot exist without data science since the data needs to be prepared before creating,
training, and testing the model.
(1)
CHAPTER 1
INTRODUCTION
1.1 About Company
I had gone for an in-house Summer Internship offered by the Poornima Institute of
Engineering & Technology, Sitapura 302022, Jaipur (Rajasthan). Poornima Institute of
Engineering & Technology (PIET) is a constituent college of the Poornima Group of
Colleges. The institute was established in 2007 in Jaipur, Rajasthan. Poornima Institute of
Engineering & Technology (PIET) offers a 4-year B. Tech program under 4 disciplines
with an annual intake of 420 students. The institute is affiliated with Rajasthan Technical
University (RTU).
Institute is Collaborating with IBM Lab for research on Business Intelligence and Cloud
Computing.
Institute has MTLC (Mission 10X Technology Learning Center) by Wipro.
Institute organizes several workshops on Technical and Non-Technical Topics. Institute
has tie-ups with industries and academics.
Institute has collaborations with Wipro for Wipro Mission10X.
Two Centre of Excellence recognized by Rajasthan Technical University:
Integrated Design and Innovations in Advanced Digital Manufacturing and AI & Big Data.
I have done my Summer Internship in Machine Learning & Data Science at Poornima Institute
of Engineering & Technology. The institute offered an in-house internship opportunity for
students of Poornima Group. Being a student of Poornima College of Engineering, which is also
a part of Poornima Group. I went for that opportunity. The mode of training was Offline in PIET.
Our training was held in Neural Network & Deep Learning Laboratory in Offline Mode for 45
days of our training.
(2)
The Summer Internship in Machine Learning & Data Science was an internship program offered
by PIET and the ending date of the training was the 8th of August,2022.
The Summer Internship in Machine Learning & Data Science was an internship program offered
by PIET, which gave us exposure to the industry and how the industry works. There were
sessions and industrial visits in our internship program that shows us the real industry scenario.
The total duration of the training was 45 days.
The duration of industrial training was 45 days and after its completion, we were assigned major
projects to submit in order to have our certificates. The exhibition was held for the projects on
18th August 2022. After the submission of my major project, I received my certificate on that
day.
(3)
1.7 Training Photos
(5)
1.8 Conclusion
The technology of training in the Summer Internship was Machine Learning & Data Science.
The institute and our training coordinators worked hard to train us. During, the internship I
learned initially starting from the basics of python, NumPy, Pandas, and Machine Learning
Techniques. I can perform data cleaning, data scraping, data manipulation, and drawing the
conclusion in form of client understandable format.
(6)
CHAPTER 2
TECHNICAL TRAINING PLATFORM
2.1 Introduction
I did Summer Internship in Machine Learning & Data Science at Poornima Institute of
Engineering & Technology. The institute offered an in-house internship opportunity for students
of Poornima Group. Being a student of Poornima College of Engineering, which is also a part of
Poornima Group. I went for that opportunity. The mode of training was Offline in PIET.
Our training was held in Neural Network & Deep Learning Laboratory in Offline Mode
for 45 days of our training.
Our training was held in Neural Network & Deep Learning Laboratory in Offline Mode for 45
days of our training.
The technology of training in the Summer Internship was Machine Learning & Data
Science. The institute and our training coordinators worked hard to train us. During, the
internship I learned initially starting from the basics of python, NumPy, Pandas, and
Machine Learning Techniques. I can perform data cleaning, data scraping, data
manipulation, and drawing the conclusion in form of client understandable format.
(8)
CHAPTER 3
OVERVIEW OF TECHNOLOGY LEARNED
Python 2.0 was released in 2000, and the 2.x versions were the prevalent releases until December
2008. At that time, the development team made the decision to release version 3.0, which
contained a few relatively small but significant changes that were not backward compatible with
the 2.x versions. Python 2 and 3 are very similar, and some features of Python 3 have been
backported to Python 2. But in general, they remain not quite compatible.
The python libraries that are mainly used for machine learning and data science are as follows:
1. NumPy
2. Matplotlib
3. Pandas
1. NumPy
NumPy stands for numeric python which is a python package for the computation and
processing of the multidimensional and single dimensional array elements. Travis
Oliphant created NumPy package in 2005 by injecting the features of the ancestor module
Numeric into another module Numarray. It is an extension module of Python which is mostly
written in C.
(9)
With the revolution of data science, data analysis libraries like NumPy, SciPy, Pandas, etc. have
seen a lot of growth. With a much easier syntax than other programming languages, python is the
first-choice language for the data scientist.
NumPy provides a convenient and efficient way to handle the vast amount of data. NumPy is
also very convenient with Matrix multiplication and data reshaping. NumPy is fast which makes
it reasonable to work with a large set of data.
There are the following advantages of using NumPy for data analysis.
Nowadays, NumPy in combination with SciPy and Mat-plotlib is used as the replacement to
MATLAB as Python is more complete and easier programming language than MATLAB.
Fig.no-6: NumPy (10)
2. Matplotlib
Human minds are more adaptive for the visual representation of data rather than textual data. We
can easily understand things when they are visualized. It is better to represent the data through
the graph where we can analyze the data more efficiently and make the specific decision
according to data analysis. Before learning the matplotlib, we need to understand data
visualization and why data visualization is important.
Data Visualization
There are five key plots that are used for data visualization.
Need of Matplotlib
Example:
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
plt.plot(x, y)
plt.show()
Output:
Fig.no-9: Matplotlib
(12)
3. Pandas
Data analysis requires lots of processing, such as restructuring, cleaning or merging, etc. There
are different tools are available for fast data processing, such as NumPy, SciPy, Cython,
and Panda. But we prefer Pandas because working with Pandas is fast, simple and more
expressive than other tools.
Benefits of Pandas:
o Data Representation: It represents the data in a form that is suited for data analysis
through its Data Frame and Series.
o Clear code: The clear API of the Pandas allows you to focus on the core part of the code.
So, it provides clear and concise code for the user.
The Pandas provides two data structures for processing the data, i.e., Series and DataFrame,
which are discussed below:
1. Series is defined as a one-dimensional array that is capable of storing various data types.
The row labels of series are called the index. We can easily convert the list, tuple, and
dictionary into series using "series' method. A Series cannot contain multiple columns. It
has one parameter:
(13)
import pandas as pd
import numpy as np
info = np. array(['P','a','n','d','a','s'])
a = pd. Series(info)
print(a)
Output
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
2. DataFrame is a widely used data structure of pandas and works with a two-dimensional
array with labelled axes (rows and columns). DataFrame is defined as a standard way to
store data and has two different indexes, i.e., row index and column index.
Example: import pandas as pd
x = ['Python', 'Pandas']
df = pd.DataFrame(x)
print(df)
Output
0
0 Python
1 Pandas
(14)
Data Science
Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the data so that you can
find something new and meaningful.
Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve the data related problems. It is the future of artificial intelligence.
In short, we can say that data science is all about:
o Understanding the data to make better decisions and finding the final result.
Some years ago, data was less and mostly available in a structured form, which could be easily
stored in excel sheets, and processed using BI tools.
But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data is
generating on every day, which led to data explosion. It is estimated as per researches, that by
2020, 1.7 MB of data will be created at every single second, by a single person on earth. Every
Company requires data to work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every organization. So, to
handle, process, and analysis of this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into existence as data Science.
(15)
Following are some main reasons for using data science technology:
o Data science is working for automating transportation such as creating a self-driving car,
which is the future of transportation.
o Data science can help in different predictions such as various survey, elections, flight
ticket confirmation, etc.
Fig.no-10: Data Science
(16)
Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions. Machine learning contains a
set of algorithms that work on a huge amount of data. Data is fed to these algorithms to train
them, and on the basis of training, they build the model & perform a specific task.
These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:
4. Reinforcement Learning
(17)
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression
Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset.
(18)
Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.
o Lasso Regression
Unsupervised learning is different from the Supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is
trained using the unlabeled dataset, and the machine predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find the
hidden patterns from the input dataset.
(19)
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a
way to group the objects into a cluster such that the objects with the most similarities remain in
one group and have fewer or no similarities with the objects of other groups. An example of the
clustering algorithm is grouping the customers by their purchasing behavior.
o Mean-shift algorithm
o DBSCAN Algorithm
Association
Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it
can generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web
usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
(20)
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised
and Unsupervised machine learning. It represents the intermediate ground between Supervised
(With Labelled training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets during the training
period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabeled
data. As labels are costly, but for corporate purposes, they may have few labels. It is completely
different from supervised and unsupervised learning as they are based on the presence & absence
of labels.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is
to effectively use all the available data, rather than only labelled data like in supervised learning.
Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it
helps to label the unlabeled data into labelled data. It is because labelled data is a comparatively
more expensive acquisition than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analyzing the same concept without any help from the instructor, it comes under unsupervised
learning. Under semi-supervised learning, the student has to revise himself after analyzing the
same concept under the guidance of an instructor at college.
(21)
4. Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.
(22)
CHAPTER 4
PROJECT DESCRIPTION
Description
The project Predict Dropout and Academic Success aims to contribute to the reduction of
academic dropout and failure in higher education, by using machine learning techniques to
identify students at risk at an early stage of their academic path, so that strategies to support them
can be put into place. The dataset includes information known at the time of student enrollment –
academic path, demographics, and social-economic factors. The problem is formulated as a
three-category classification task (dropout, enrolled, and graduate) at the end of the normal
duration of the course.
The data is used to build classification models for predicting the student’s academic success and
dropout. This problem is formulated as a three-category classification task, in which there is a
strong imbalance towards one of the classes.
Predict Dropout or Academic Success is a machine learning model that a student will drop out
or will have academic success based on the variables given i.e., Curricular units 1 st Sem and 2nd
Sem, age, and gender.
(23)
Project Snapshots:
Fig.no-12: Project
CONCLUSION
Machine Learning can be a Supervised or Unsupervised. If you have lesser amount of data and
clearly labelled data for training, opt for Supervised Learning. Unsupervised Learning would
generally give better performance and results for large data sets. If you have a huge data set
easily available, go for deep learning techniques. You also have learned Reinforcement Learning
and Deep Reinforcement Learning. You now know what Neural Networks are, their applications
and limitations.
Finally, when it comes to the development of machine learning models of your own, you looked
at the choices of various development languages, IDEs and Platforms. Next thing that you need
to do is start learning and practicing each machine learning technique. The subject is vast, it
means that there is width, but if you consider the depth, each topic can be learned in a few hours.
Each topic is independent of each other. You need to take into consideration one topic at a time,
learn it, practice it and implement the algorithm/s in it using a language choice of yours. This is
the best way to start studying Machine Learning. Practicing one topic at a time, very soon you
would acquire the width that is eventually required of a Machine Learning expert.
Using machine learning is a powerful tool that can help you gain valuable insight from your data.
However, it’s important to remember that it is still an art to master. It’s imperative that you have
a good understanding of how to organize and use data. The field of data science is a complex one
that spans a variety of domains, and machine learning is one of the most exciting. This
technology can help your business solve problems and make better decisions by using data to
predict the future. Using machine learning algorithms can help you prevent financial fraud.
These algorithms analyze billions of online transactions and recognize patterns in them, enabling
them to generate insights about new data.
Because machine learning involves coding lessons from examples of good data, it’s a versatile
and powerful tool. The applications of these techniques are limitless, and there’s no shortage of
opportunities for a data scientist with expertise in these techniques.
(25)
REFERENCES
[1] Machine Learning by Tom M. Mitchell
[2] Machine Learning Using Python by Manaranjan Pradhan
[3] Superintelligence by Nick Bostrom
[4] docs.python.org
[5] Building a Reproducible Machine Learning Pipeline (Paper)
[6] A Tour of End-to-End Machine Learning Platforms (Article)
[7] Efficient ML engineering: Tools and best practices (Article)
[8] MLOps.community (Community)
(26)