Full Data Science
Full Data Science
Full Data Science
(IT)
SEMESTER - I (CBCS)
DATA SCIENCE
SUBJECT CODE : PSIT102
© UNIVERSITY OF MUMBAI
Prof. Suhas Pednekar
Vice Chancellor
University of Mumbai, Mumbai.
Prof. Ravindra D. Kulkarni Prof. Prakash Mahanwar
Pro Vice-Chancellor, Director
University of Mumbai. IDOL, University of Mumbai.
Unit I
1. Data Science Technology Stack 1
2. Vermeulen-Krennwallner-Hillman-Clark 18
Unit II
3. Three Management Layers 24
4. Retrieve Super Step 34
Unit III
5. Assess Superstep 48
6. Assess Superstep 66
Unit IV
7 Process Superstep 83
8. Transform Superstep 104
Unit V
9. Transform Superstep 135
10. Organize And Report Supersteps 151
*****
Syllabus
M. Sc (Information Technology) Semester – I
Course Name: Data Science Course Code: PSIT102
Periods per week (1 Period Lectures 4
is 60 minutes)
Credits 4
Hours Marks
Evaluation System Theory Examination
Theory 2½Internal 60
Examination
Theory Internal - 25
Practical No Details
1 - 10 Practical based on above syllabus, covering entire syllabus
1
DATA SCIENCE
TECHNOLOGY STACK
Unit Structure
1.1 Introduction
1.2 Summary
1.3 Unit End Questions
1.4 References
● Functional Layer:
The functional layer is the core processing capability of the factory.
Core functional data processing methodology is the R-A-P-T-O-
R framework.
1
● Organize Super Step.
The organize super step sub-divides the data warehouse into data marts.
Business Layer:
2
● A traditional database management system only works with schema
and it will work once the schema is described and there is only one
point of view to describe and view the data into the database.
● It stores a dense of data and all the data are stored into the datastore
and schema on write widely use methodology to store the dense data.
● Schema on write schemas are build with the purpose which makes
them change and maintain the data into the database.
● When there is a lot of raw data which are available for the processing,
during, some of the data are lost and it makes them weak for future
analysis.
● If some important data are not stored into the database then you cannot
process the data for further data analysis.
3
Data Lake:
● A Data Lake is storage repository of large amount of raw data that
means structure, semi-structure, unstructured data.
● This is the place where you can store three types of data structure,
semi-structure, unstructured data with no fix amount of limit and
storage to store the data.
● If we compare schema on write and data lake then we will find that
schema on write store the data into the data warehouse in predefined
database on the other hand data lake store the less data structure to
store the data into the database.
● Data Lake follow to store less data into the structure database because
it follows the schema on read process architecture to store the data.
● Data Lake allow us to transform the raw data that means structure,
semi-structure, unstructured data into the structure data format so that
SQL query could be performed for the analysis.
● Most of the time data lake is deployed by using the distributed data
object storage database which enable the schema on read so that
business analytics and data mining tools and algorithms can be applied
on the data.
● Retrieval of data is so fast because there is no schema applied. Data
must be access without any failure or any complex reason.
● Data Lake is similar to real time river or lake where the water comes
from different- different places and at the last all the small-small river
and lake are merged into the big river or lake where large amount of
water are stored, whenever there is need of water then it can be used
by anyone.
● It is low cost and effective way to store the large amount of data stored
into centralized database for further organizational analysis and
deployment.
Figure 1.1
4
Data Vault:
● Data Vault is a database modeling method which is designed to store
the long-term historical storage amount of data and it can controlled by
using the data vault.
● In Data Vault, data must come from different sources and it is
designed in such a ways that data could be loaded in parallel ways so
that large amount of data implementation can be done without any
failure or any major design.
● Data Vault is the process of transforming the schema on read data lake
into schema on write data lake.
● Data Vault are designed schema on read query request and after that it
would be converted into the data lake because schema on read increase
the speed of generating new data for the better analysis and
implementation.
● Data Vault store a single version of data and does not distinguish
between good data and bad data.
● Data Lake and Data Vault are built by using the three main component
or structure of data i.e. Hub, Link and satellite.
2. Hub :
● Hub has unique business key with low amount of data to be changed
and meta data that means data is the main source of generating the
hubs.
● Hub contains surrogate key for each metadata information and each
hub items i.e. origin of this business key.
● Hub contains a set of unique business key that will never change over
a period manner.
● There are different types of hubs like person hub, time hub, object hub,
event hub, locations hub. The Time hub contains ID Number, ID Time
Number, ZoneBasekey, DateTimekey, DateTimeValue and all these
links are interconnected to each other like Time-Person, Time-Object,
Time-Event, Time-Location, Time-Links etc.
● The Person hub contains IDPersonNumber, FirstName, SecondName,
LastName, Gender, TimeZone, BirthDateKey, BirthDate and all these
links are interconnected to each other like Person-Time, Person-
Object, Person-Location, Person-Event, Person-Link etc.
● The Object hub contains IDObjectNumber, ObjectBaseKey,
ObjectNumber, ObjectValue and all these links are interconnected to
each other like Object-Time, Object-Link, Object-Event, Object-
Location, Object-Person etc.
● The Event hub contains ID Event Number, Event Type, Event
Description and all these links are interconnected to each other like
Event-Person, Event-Location, Event-Object, Event-Time etc.
5
● The Location hub contains ID Location Number, Object Base Key,
Location Number, Location Name, Longitude and Latitude all these
links are interconnected to each other like Location-Person, Location-
Time, Location-Object, Location-event etc.
Link:
● Link plays a very important role during transaction and association of
business key. The Table relate to each other depending upon the
attribute of table like that one to one relationship, one to many
relationships, Many to One relationship, Many to many relationships.
● Link represents and connect only element in the business relationships
because when one node or link relate to one or another link on that time
data transfers smoothly.
Satellites:
● When the hubs and links produce and form the structure of satellites
which store no chronological structure of data means then it would not
provide the information about the mean, median, mode, maximum,
minimum, sum of the data.
● Satellites are the strong structure of data that store a detailed
information about the related data or business characteristics key and
stores large volume of data vault.
● The combinations of all these three i.e. hub, link, and satellites are
formed together to help the data analytics and data scientists and data
engineer to store the business structure, types of information or data
into it.
Figure 1.2
Data Science Processing Tools:
● The process of transforming the data, data lake to data vault and then
transferring the data vault into data warehouse.
● Most of the data scientist and data analysis, data engineer uses these
data science processing tool to process and transfer the data vault into
data warehouse.
6
1. Spark:
● Apache Spark is an open source clustering computing framework. The
word open source means it is freely available on internet and just go
on internet and type apache spark and you can get freely source code,
you can download and use according to your wish.
● Apache Spark was developed at AMP Lab of university of California,
Berkeley and after that all the code and data was donated to Apache
Software Foundation to keep doing changes over a time and make it
more effective, reliable, portable that will run on all the platform.
● Apache Spark provide an interface for the programmer and developer
to directly interact with the system and make data parallel and
compatible with data scientist and data engineer.
● Apache Spark has the capabilities and potential, process all types and
variety of data with repositories including Hadoop distributed file
system, NoSQL database as well as apache spark.
● IBM are hiring most of the data scientist and data engineer, who has
more knowledge and information about apache spark project so that
innovation could perform an easy way and will come up with more
feature and changing.
● Apache Spark has potential and capabilities to process the data very
fast and hold the data in memory and transfer the data into memory
data processing engine.
● It has been built on top of the Hadoop distributed file system which
make the data more efficient, more reliable and make it more
extendable on the Hadoop map reduce.
Figure 1.3
2. Spark Core:
● Spark Core is base and foundation for over all of the project
development and provide some most important Information like
distributed task, dispatching, scheduling and basic Input and output
functionalities.
7
● By using spark core, you can have more complex queries that will help
us to work with complex environment.
● The distributed nature of spark ecosystem enables you the same
processing data on a small cluster, to go for hundreds or thousand of
nodes without making any changes.
● Apache Spark uses Hadoop in two ways one is storage and second one
is for processing purpose.
● Spark is not a modified version of Hadoop distributed file system,
because it depend upon the Hadoop which has its own feature and tool
for data storage and data processing.
● Apache Spark has a lot of feature which makes it compatible and
reliable. Speed is one the most important feature of spark that means
with the help of spark, your application are able to run on directly on
Hadoop and it is 100 times much faster in the memory.
● Apache Spark Core support many more language and it has its own
built in function and API in java, Scala, python that means you can
write the application by using the java, python, C++, Scala etc.
● Apache Spark Core has come up with advanced analytics that means it
does not support the map and reduce the potential and capabilities to
support SQl Queries, Machine Learning and graph Algorithms.
Figure 1.4
3. Spark SQL:
● Spark SQL is a component on top of the Spark Core that presents data
abstraction called Data Frames.
● Spark SQL is fast clustering data abstraction, so that data manipulation
can be done for fast computation.
● It enables the user to run SQL/HQL on top of the spark and by suing
this, we can process the structure, unstructured and semi structured
data.
8
● Apache Spark SQL provides a much relationship between relational
database and procedural processing. This comes, when we want to
load the data from traditional way into data lake ecosystem.
● Spark SQL is Apache Spark’s module for working with structured and
semi-structured data and it originated to overcome the limitation of
apache hive.
● It always dependent upon the MapReduce engine of Hadoop for
execution and processing of data and allows the batch-oriented
operation.
● Hive lags in performance uses to MapReduce jobs for executing ad-
hoc process and hive does not allow you to resume a job processing, if
it fails in the middle.
● Spark performs better operation than hive in many situation. Latency
in the terms of hours and CPU reservation time.
● You can integrate the Spark SQL and queryingstructured, semi-
structured data inside the apache spark.
● Spark SQL follows the RDD Model and it also support large job and
middle query fault tolerance.
● You can easily connect the Spark SQL with the JDBC and ODBC for
better connectivity of business purpose.
4. Spark Streaming:
● Apache Spark Streaming enables powerful interactive and data
analytics application for live streaming data. In Streaming, data is not
fixed and data comes from different source continuously.
● Stream divide the incoming input data into the small-small unit of data
for further data analytics and data processing for next level.
● There are multilevel of processing involved in it. Live streaming data
are received and divided into small-small parts or batches and these
small-small of data or batches are then processed or mixed by the
spark engine to generate or produced the final level of streaming of
data.
● Processing of data in the system in Hadoop has very high latency
means that data is not received on timely manner and it is not suitable
for real time processing requirement.
● Processing of data is generated by storm, if it is not happened again.
But this type of mistake and latency give the data loss and repetition of
records processing.
● Most of scenario, Hadoop are used for data batching and Apache
Spark are used for the live streaming of data.
● Apache Streaming provide and help us to fix these types of issue and
provides reliable, portable, scalable, efficiency, and integration of the
system.
9
Figure 1.5
5. GraphX:
GraphX is very powerful graph processing tool application
programming interface for apache spark analytics engine.
● GraphX is a new component in a spark for graphs and graphs-parallel
computation.
● GraphX follow the ETL process that means Extract, transform and
Load, exploratory analysis, iterative graph computation within a single
system.
● The usage can be seen in the Facebooks, LinkedIn connection, google
map, and internet routers use these types of tool for better response
and analysis.
● Graph is an abstract data types that means it is used to implement the
directed and undirected graph concepts from the mathematics in the
graph theory concept.
● In the graph theory concept, each data associate with some other data
with edge like numeric value.
● Every edge and node or vertex have user defined properties and values
associated with it.
Figure 1.6
10
● GraphX has more flexibilities to work with graph and computation.
Graph follow the ETL process that means Extract, transform and Load,
exploratory analysis, iterative graph computation within a single
system.
● Speed is one of the most important point in the point of Graph and it is
comparable with the fastest graph system while when there is any fault
tolerance and provide ease of use.
● We can choose lots of more feature that comes with more flexibilities
and reliability and it provide library of graph algorithms.
6. Mesos:
● Apache Mesos is an open source cluster manager and it was developed
by the universities of California, Berkeley.
● It provides all the required resource for the isolation and sharing
purpose across all the distributed application.
● The software we are using for Mesos, provide resources sharing in a
fine-grained manner so that improving can be done with the help of
this.
● Mesosphere enterprises DC/OS is the enterprise version of Mesos and
this run specially on Kafka, Cassandra, spark and Akka.
● It can handle the workload in distributed environments by suing the
dynamic sharing and isolation manner.
● Apache Mesos could be deployed and manageable of large amount of
data distribution scale cluster environment.
● Whatever the data are available in the existing system, it will be
grouped together with the machine or node of the cluster into a single
cluster so that load could be optimized.
● Apache Mesos is totally opposite to the virtualizations because in
virtualization one physical resource is going to be shared with multiple
virtual resource but in Mesos multiple, physical resource are grouped
together to form a single virtual machine.
Figure 1.7
11
7. Akka:
● Akka is an actor-based message driven runtime for running
concurrency, elasticity, and resilience processes.
● The actor can be controlled and limited to perform the intended task
only. Akka is an open source library or toolkit.
● Apache Akka is used to create distributed and fault tolerance and it can
be integrated to this library into the java virtual machine or JVM to
support the language.
● Akka could be integrated with the Scala programming language and it
is written in the Scala and it help us and developers to deal with
external locking and threat management.
Figure 1.8
8. Cassandra:
● Apache Cassandra is an open source distributed database system that is
designed for storing and managing large amount of data across
commodity servers.
● Cassandra can be used for both real time operational data store for
online transaction data application.
12
● Cassandra is designed for to have peer to peer process continues nodes
instead of master or named nodes to ensure that there should not be
any single point of failure.
● Apache Cassandra is a highly scalable, high performance distributed
database designed to handle large amounts of data and it is type of
NoSQL Database.
● A NoSQL database is a database that provide mechanism to store and
retrieve the data from the database than relational database.
● NoSQL database uses different data structure compared to relational
database and it support very simple query language.
● NoSQL Database has no Schema and does not provide the data
transaction.
● Cassandra is being used by some of the most important companies like
Facebook, Twitter, Cisco, Netflix and much more.
● The component of the Cassandra is Node, Data Center, Commit Log,
Cluster, Mem-Table, SS Table, Bloom Filter.
Figure 1.9
13
Figure 1.10
3. Kafka:
● Kafka is a high messaging backbone that enables communication
between data processing entities and Kafka is written in java and Scala
language.
● Apache Kafka is highly scalable, reliable, fast, and distributed system.
Kafka is suitable for both offline and online message consumption.
● Kafka messages are stored on the hard disk and replicated within the
cluster to prevent the data loss.
● Kafka is distributed, partitioned, replicated and fault tolerant which
make it more reliable.
● Kafka messaging system scales easily without down time which make
it more scalable. Kafka has high throughput for both publishing and
subscribing messages, and it can store data up to TB.
● Kafka has unique platform for handling the real time data for feedback
and it can handle large amount of data to diverse consumers.
● Kafka persists all data to the disk, which essentially means that all the
writes go to the page cache of the OS (RAM). This makes it very
efficient to transfer data from page cache to a network socket.
Figure 1.11
14
Different Programming languages in data science processing:
1. Elastic Search:
● Elastic Search is an open source, distributed search and analytical
engine designed.
● Scalability mean that it can scale any point of view, reliability means
that it should be trustable, stress free management.
● Combine the power of search and power of analytics so that
developers, programmers, data engineer and data scientist could work
with very smoothly with structures, un-structured, and time series data.
● Elastics search is an open source that means anyone can download and
work with it and it is developed by using java language and most of
the big organization are using this search engine for their need.
● It enables the user to expand the very large amount of data at very high
speed.
● It is used for the replacement of the documents and data store in the
database like mongo dB etc.
● Elastic search is one of the popular search engines and mostly used by
the recent organization like google, stack Overflow, GitHub and much
more.
● Elastic Search is an open source search engine and is available under
the hive version 2.0.
2. R Language:
● R is a programming language and it is used for statistical computing
and graphics purpose.
● R Language are used by data engineer, data scientist, statisticians, and
data miners for developing the software and performing data analytics.
● There is core requirement before learning the R Language and some
depend on library and package concept that you should know about it
and know how to work upon it easily.
● The related packages are of R Language is sqldf, forecast, dplyr,
stringer, lubridate, ggplot2, reshape etc.
● R language is freely available, and it comes with General Public
License and it supports many of the platform like windows,
Linux/Unix, Mac.
● R language has built in capability to support and can be implemented
and integrated with procedural language written in c, c++, java, .Net,
and python.
● R Language has capacity and potential for handling data and data
storage.
15
3. Scala:
● Scala is a general-purpose programming language and it support
functional programming and a strong type statics type system.
● Most of the data science project and framework are build by using the
Scala programming language because it has so many capabilities and
potential to work with it.
● Scala integrate the feature of object-oriented language and its function
because Scala can be written in java, c++, python language.
● Types and behavior of objects are described by the class and class can
be extended by another class by using its properties.
● Scala support the high-level functions and function can be called by
another function by using and written the function in a code.
● Once the Scala program is ready to compile and executive, Scala
program convert into the byte code (machine understandable language)
with the help of java virtual machine.
● This means that Scala and Java Programs can be complied and
executed by using the JVM. So, we can easily say that it can be moved
from Java to Scala and vice-versa.
● Scala enables you to use and import all the class, object and its
behavior and function because Scala and java run with the help of Java
Virtual Machine and you can create its own class and object.
● Instead of writing thousands of codes, Scala reduce the code in such a
way that it can be readable, reliable, portable and reduce lines of code
and support the developer and programmers to type the code in easy
way.
4. Python:
● Python is a programming language and it can used on a server to
create web application.
● Python can be used for web development, mathematics, software
development and it is used to connect the database and create and
modify the data.
● Python can handle the large amount of data and it is capable and
potential to perform the complex task on data.
● Python is reliable, portable, and flexible to work on different platform
like windows, mac and Linux etc.
● As compare to the other programming language , python is easy to
learn and can perform the simple as well as complex task and it has the
capabilities to reduce the line of code and help the programmer and
developers to work with is easily friendly manner.
● Python support object-oriented programming language, functional and
work with structure data.
16
● Python support dynamics data type and can be supported by dynamics
type checking.
● Python is an interpreter and it has the philosophy and statements that it
reduces the line of code.
SUMMARY
Chapter will help you to recognize the basics of data science tools
and their influence on modern data lake development. You will discover
the techniques for transforming a data vault into a data warehouse bus
matrix. It will explain the use of Spark, Mesos, Akka, Cassandra, and
Kafka, to tame your data science requirements. It will guide you in the use
of elastic search and MQTT (MQ Telemetry Transport), to enhance your
data science solutions. It will help you to recognize the influence of R as a
creative visualization solution. It will also introduce the impact and
influence on the data science ecosystem of such programming languages
as R, Python, and Scala.
REFERENCES
● Principal Data Science, Redundant Storage Architecture Andreas
Francois vermeulen,Apress,2018
● Principal Data Science, Sinan Ozdemir,PACKT 2016.
● Data Science from Scratch, Joel Grus, O’Really 2015.
● Data Science from Scratch first principle in Python, JoelGrus, Shroff
Publisher, 2017.
● Experimental Design in Data Science with Least Resources,N C Das,
Shroff Publishers 2018
*****
17
2
VERMEULEN-KRENNWALLNER-
HILLMAN-CLARK
● Vermeulen-krennwallner-Hillman-Clark is small group like VKHCG
and has a small size international company and it consist of 4
subcomponent 1. Vermeulen PLC, 2. Krennwallner AG, 3. Hillman
Ltd 4. Clark Ltd.
1. Vermeulen PLC:
● Vermeulen PLC is a data processing company which process all the
data within the group companies.
● This is the company for which we hire most of the data engineer and
data scientist to work with it.
● This company supplies data science tool, Network, server and
communication system, internal and external web sites, decision
science and process automation.
2. Krennwallner AG:
● This is an advertising and media company which prepares advertising
and media information which is required for the customers.
● Krennwallner supplies advertising on billboards, make Advertising
and content management for online delivery etc.
● By using the number of record and data which are available on
internet for media stream, it takes the data from there and make an
analysis on this according to that it searches which of the media stream
are watched by customer, how many time and which is most watchable
content on internet.
● By using the survey, it specifies and choose content for the billboards,
make and understand how many times customer are visited for which
channel.
3. Hillman Ltd:
● This is logistic and supply chain company and it is used to supply the
data around the worldwide for the business purpose.
● This include client warehouse, international shipping, home – to –
home logistics.
18
4. Clark Ltd:
● This is the financial company which process all financial data which is
required for financial purpose includes Support Money, Venture
Capital planning and allow to put your money on share market.
Scala:
● Scala is a general-purpose programming language and it support
functional programming and a strong type statics type system.
● Most of the data science project and framework are built by using the
Scala programming language because it has so many capabilities and
potential to work with it.
● Scala integrate the feature of object-oriented language and its function
because Scala can be written in java, C++, python language.
● Types and behavior of objects are described by the class and class can
be extended by another class by using its properties.
● Scala support the high-level functions and function can be called by
another function by using and written the function in a code.
Apache Spark:
● Apache Spark is an open source clustering computing framework. The
word open source means it is freely available on internet and just go
on internet and type apache spark and you will get freely source code
are available there, you can download and according to your wish.
● Apache Spark was developed at AMP Lab of university of California,
Berkeley and after that all the code and data was donated to Apache
Software Foundation for keep doing changes over a time and make it
more effective, reliable, portable that will run all the platform.
● Apache Spark provide an interface for the programmer and developer
to directly interact with the system and make data parallel and
compatible with data scientist and data engineer.
● Apache Spark has the capabilities and potential, process all types and
variety of data with repositories including Hadoop distributed file
system, NoSQL database as well as apache spark.
● IBM are hiring most of the data scientist and data engineer to whom
has more knowledge and information about apache spark project so
that innovation could be perform an easy way and will come up with
more feature and changing.
Apache Mesos:
● Apache Mesos is an open source cluster manager and it was developed
by the universities of California, Berkeley.
● It provides all the required resource for the isolation and sharing
purpose across all the distributed application.
19
● The software we are using for Mesos, provide resources sharing in a
fine-grained manner so that improvement can be done with the help of
this.
● Mesosphere enterprises DC/OS is the enterprise version of Mesos and
this run specially on Kafka, Cassandra, spark and Akka.
Akka:
● Akka is an actor-based message driven runtime for running
concurrency, elasticity, and resilience processes.
● The actor can be controlled and limited to perform the intended task
only. Akka is an open source library or toolkit.
● Apache Akka is used to create distributed and fault tolerant and it can
be integrated to the library into the java virtual machine or JVM to
support the language.
● Akka could be integrated with the Scala programming language and it
is written in the Scala and it help us and developers to deal with
external locking and threat management.
Apache Cassandra:
● Apache Cassandra is an open source distributed database system that is
designed for storing and managing large amount of data across
commodity servers.
● Cassandra can be used for both real time operational data store for
online transaction data application.
● Cassandra is designed to have peer to peer process continuing nodes
instead of master or named nodes to ensure that there should not be
any single point of failure.
● Apache Cassandra is a highly scalable, high performance distributed
database designed to handle large amounts of data and it is type of
NoSQL Database.
● A NoSQL database is a database that provide mechanism to store and
retrieve the data from the database than relational database.
Kafka:
● Kafka is a high messaging backbone that enables communication
between data processing entities and Kafka is written in java and Scala
language.
● Apache Kafka is highly scalable, reliable, fast, and distributed system.
Kafka is suitable for both offline and online message consumption.
● Kafka messages are stored on the hard disk and replicated within the
cluster to prevent the data loss.
● Kafka is distributed, partitioned, replicated and fault tolerant which
make it more reliable.
20
● Kafka messaging system scales easily without down time which make
it more scalable. Kafka has high throughput for both publishing and
subscribing messages, and it can store data up to TB.
Python:
● Python is a programming language and it can used on a server to create
web application.
● Python can be used for web development, mathematics, software
development and it is used to connect the database and create and
modify the data.
● Python can handle the large amount of data and it is capable to perform
the complex task on data.
● Python is reliable, portable, and flexible to work on different platform
like windows, mac, and Linux etc.
● Python can be installed on all the operating system example windows,
Linux and mac operating system and it can work on all these platforms
for better understanding and learning purpose.
You can earn much more knowledge by installing and working all three
platform for data science and data engineering.
● To working and installing the data science required package in python,
Ubunturun the following command below:
● sudo apt-get install python3 python3-pip python3-setuptools
● To working and installing the data science required package in python,
Linuxrun the following command below:
● sudo yum install python3 python3-pip python3-setuptools
● To work and install the data science required package in python,
Windowsrun the following command below:
● https://www.python.org/downloads/
● Python Libraries:
● Python library is a collection of functions and methods that allows
you to perform many actions without writing your code.
Pandas:
● Pandas stands for panel data and it is the core library for data
manipulation and data analysis.
● It consists of single and multidimensional data for data analysis.
● How to install pandas in UBUNTU by using the following commands:
● sudo apt-get install python-pandas
● How to install pandas in LINUX by using the following commands:
● yum install python-pandas
21
● How to install pandas in WINDOWS by using the following
commands:
● pip install pandas
Matplotlib:
● Matplotlib is used for data visualization and is one of the most
important packages of python.
● Matplotlib is used to display and visualize the 2D data and it is written
in python.
● It can be used for python, Jupiter, notebook and web application server
also.
● How to install Matplotlib Library for UBUNTU in python by using the
following command:
● sudo apt-get install python-matplotlib
● How to install Matplotlib Library for LINUX in python by using the
following command:
● Sudo yum install python-matplotlib
● How to install Matplotlib Library for WINDOWS in python by using
the following command:
● pip install matplotlib
NumPy:
● NumPy is the fundamental package of python language and is used for
the numerical purpose.
● NumPy is used with the SciPy and Matplotlib package of python and it
is freely available on internet.
SymPy:
● Sympy is a python library and which is used for symbolic mathematics
and it can be used with complex algebra formula.
R:
● R is a programming language and it is used for statistical computing
and graphics purpose.
● R Language is used by data engineer, data scientist, statisticians, and
data miners for developing the software and performing data analytics.
● There is core requirement before learning the R Language and some
depend on library and package concept that you should know about it
and know how to work upon it easily.
● The related packages are of R Language is sqldf, forecast, dplyr,
stringer, lubridate, ggplot2, reshape etc.
22
SUMMARY
References
*****
23
Unit II
3
THREE MANAGEMENT LAYERS
Unit Structure
3.0 Objectives
3.1 Introduction
3.2 Operational Management Layer
3.2.1 Definition and Management of Data Processing stream
3.2.2 Eco system Parameters
3.2.3 Overall Process Scheduling
3.2.4 Overall Process Monitoring
3.2.5 Overall Communication
3.2.6 Overall Alerting
3.3 Audit, Balance, and Control Layer
3.4 Yoke Solution
3.5 Functional Layer
3.6 Data Science Process
3.7 Unit end Question
3.8 References
3.0 OBJECTIVES
3.1 INTRODUCTION
24
3.2 OPERATIONAL MANAGEMENT LAYER
● Operations management is one of the areas inside the ecosystem
responsible for designing and controlling the process chains of a data
science environment.
● This layer is the center for complete processing capability in the data
science ecosystem.
● This layer stores what you want to process along with every
processing schedule and workflow for the entire ecosystem.
25
● Example: an ecosystem setup phase
26
▪ the “rope” is attached to all the processes from beginning to end of
the pipeline, this makes sure that no processing is done that is not
attached to the drum.
● This approach ensures that all the processes in the pipeline complete
more efficiently, as no process is entering or leaving the process
pipeline without been recorded by the drum’s beat.
Process Monitoring:
● The central monitoring process makes sure that there is a single
unified view of the complete system.
● We should always ensure that the monitoring of our data science is
being done from a single point.
● With no central monitoring running different data science processes on
the same ecosystem will make managing a difficult task.
Overall Communication:
● The Operations management handles all communication from the
system, it makes sure that any activities that are happening are
communicated to the system.
● To make sure that we have all our data science processes trackedwe
may use a complex communication process.
27
3.3.1 Audit:
● An audit refers to an examination of the ecosystem that is systematic
and independent
● This sublayer records which processes are running at any given
specific point within the ecosystem.
● Data scientists and engineers use this information collected to better
understand and plan future improvements to the processing to be done.
● the audit in the data science ecosystem, contain of a series of observers
which record prespecified processing indicators related to the
ecosystem.
Built-in Logging:
● It is always a good thing to design our logging in an organized
prespecified location, this ensures that we capture every relevant log
entry in one location.
● Changing the internal or built-in logging process of the data science
tools should be avoid as this complicate any future upgrades complex
and will prove very costly to correct.
● A built-in Logging mechanism along with a cause-and-effect
analysis system allows you to handle more than 95% of all issues that
can rise in the ecosystem.
● Since there are five layers it would be a good practice to have five
watchers for each logging locations independent of each other as
described below:
Debug Watcher:
● This level of logging is the maximum worded logging level.
● Any debug logs if discovered in the ecosystem,
it should raise an alarm, indicating that the tool is using some precise
processing cycles to perform the necessary low-level debugging.
Information Watcher:
● The information watcher logs information that is beneficial to the
running and management of a system.
28
● It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
Warning Watcher:
● Warning is usually used for exceptions that are handled or other
important log events.
● Usually this means that the issue was handled by the tool and also took
corrective action for recovery.
● It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
Error Watcher:
● An Error logs all unhandled exceptions in the data science tool.
● An Error is a state of the system. This state is not good for the overall
processing, since it normally means that a specific step did not
complete as expected.
● In case of an error the ecosystem should handle the issue and take the
necessary corrective action for recovery.
● It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
Fatal Watcher:
● Fatal is a state reserved for special exceptions or conditions for which
it is mandatory that the event causing this state be identified
immediately.
● This state is not good for the overall processing, since it normally
means that a specific step did not complete as expected.
● In case of an fatal error the ecosystem should handle the issue and take
the necessary corrective action for recovery.
● It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
Process Tracking:
● For Process Tracking it is advised to create a tool that will perform a
controlled, systematic and independent examination of the process for
the hardware logging.
● There may be numerous server-based software that monitors system
hardware related parameters like voltage, fan speeds, temperature
sensors and clock speeds of a computer system.
29
● It is advised to use the tool which your customer and you bot are most
comfortable to work with.
● It is also advised that logs generated should be used for cause-and-
effect analysis system.
Data Provenance:
● For every data entity all the transformations in the system should be
tracked so that a record can be generated for activity.
● This ensures two things: 1. that we can reproduce the data, if required,
in the future and 2. That we can supply a detailed history of the data’s
source in the system throughout its transformation.
Data Lineage:
● This involves keeping records of every change whenever it happens to
every individual data value in the data lake.
● This help us to figure out the exact value of any data item in the past.
● This is normally accomplished by enforcing a valid-from and valid-to
audit entry for every data item in the data lake.
3.3.2 Balance:
● The balance sublayer has the responsibility to make sure that the data
science environment is balanced between the available processing
capability against the required processing capability or has the ability
to upgrade processing capability during periods of extreme processing.
● In such cases the on-demand processing capability of a cloud
environment becomes highly desirable.
3.3.3 Control:
● The execution of the current active data science processes is controlled
by the control sublayer.
● The control elements of the control sublayer are a combination of:
● the control element available in the Data Science Technology
Stack’s tools and
● a custom interface to control the overarching work.
● When processing pipeline encounters an error, the control sublayer
attempts a recovery as per our prespecified requirements else if
recovery does not work out it will schedule a cleanup utility to undo
the error.
● The cause-and-effect analysis system is the core data source for the
distributed control system in the ecosystem.
30
3.4 YOKE SOLUTION
3.4.1 Producer:
● The producer is the part of the system that generates the requests for
data science processing, by creating structures messages for each type
of data science process it requires.
● The producer is the end point of the pipeline that loads messages into
Kafka.
3.4.2 Consumer:
● The consumer is the part of the process that takes in messages and
organizes them for processing by the data science tools.
● The consumer is the end point of the pipeline that offloads the
messages from Kafka.
31
● Programming and modeling. Any data science project must have
processing elements in this
32
● The processing algorithms and data models are spread across six
supersteps for processing the data lake.
1. Retrieve: This super step contains all the processing chains for
retrieving data from the raw data lake into a more structured
format.
3. Assess: This super step contains all the processing chains for quality
assurance and additional data enhancements.
3. Process: This super step contains all the processing chains for
building the data vault.
4. Transform: This super step contains all the processing chains for
building the data warehouse from the core data vault.
5. Organize: This super step contains all the processing chains for
building the data marts from the core data warehouse.
6. Report: This super step contains all the processing chains for
building virtualization and reporting of the actionable knowledge.
REFERENCES
Andreas François Vermeulen, “Practical Data Science - A Guide to
Building the Technology Stack for Turning Data Lakes into Business
Assets”
*****
33
4
RETRIEVE SUPER STEP
Unit Structure
4.0 Objectives
4.1 Introduction
4.2 Data Lakes
4.3 Data Swamps
4.3.1 Start with Concrete Business Questions
4.3.2 Data Quality
4.3.4 Audit and Version Management
4.3.5 Data Governance
4.3.5.1. Data Source Catalog
4.3.5.2. Business Glossary
4.3.5.3. Analytical Model Usage
4.4 Training the Trainer Model
4.5 Shipping Terminologies
4.5.1 Shipping Terms
4.5.2 Incoterm 2010
4.6 Other Data Sources /Stores
4.7 Review Questions
4.8 References
4.0 OBJECTIVES
4.1 INTRODUCTION
● The Retrieve super step is a practical method for importing a data lake
consisting of different external data sources completely into the
processing ecosystem.
34
● The Retrieve super step is the first contact between your data science
and the source systems.
● The successful retrieval of the data is a major stepping-stone to
ensuring that you are performing good data science.
● Data lineage delivers the audit trail of the data elements at the lowest
granular level, to ensure full data governance.
● Data governance supports metadata management for system
guidelines, processing strategies, policies formulation, and
implementation of processing.
● Data quality and master data management helps to enrich the data
lineage with more business values, if you provide complete data
source metadata.
● The Retrieve super step supports the edge of the ecosystem, where
your data science makes direct contact with the outside data world. I
will recommend a current set of data structures that you can use to
handle the deluge of data you will need to process to uncover critical
business knowledge.
● A company’s data lake covers all data that your business is authorized
to process, to attain an improved profitability of your business’s core
accomplishments.
● The data lake is the complete data world your company interacts with
during its business life span.
● In simple terms, if you generate data or consume data to perform your
business tasks, that data is in your company’s data lake.
● Just as a lake needs rivers and streams to feed it, the data lake will
consume an unavoidable deluge of data sources from upstream and
deliver it to downstream partners
● Data swamps are simply data lakes that are not managed.
● They are not to be feared. They need to be tamed.
● Following are four critical steps to avoid a data swamp.
1. Start with Concrete Business Questions
2. Data Quality
3. Audit and Version Management
4. Data Governance
35
4.3.1 Start with Concrete Business Questions:
● Simply dumping a horde of data into a data lake, with no tangible
purpose in mind, will result in a big business risk.
● The data lake must be enabled to collect the data required to answer
your business questions.
● It is suggested to perform a comprehensive analysis of the entire set of
data you have and then apply a metadata classification for the data,
stating full data lineage for allowing it into the data lake.
36
• Ex. ISO 3166 defines Country Codes as per United Nations
Sources
● Long description (It should kept as complete as possible)
• Country codes and country names used by your organization as
standard for country entries
● Contact information for external data source
• ISO 3166-1:2013 code lists from www.iso.org/iso-3166-
country-codes.html
● Expected frequency
• Irregular i.e., no fixed frequency, also known as ad hoc, every
minute, hourly, daily, weekly, monthly, or yearly.
• Other options are near-real-time, every 5 seconds, every minute,
hourly, daily, weekly, monthly, or yearly.
37
4.3.4.3 Analytical Model Usage:
● Data tagged in respective analytical models define the profile of the
data that requires loading and guides the data scientist to what
additional processing is required.
● The following data analytical models should be executed on every data
set in the data lake by default.
INPUT_DATA_with_ID=
Row_ID_to_column(INPUT_DATA_FIX, var =
"Row_ID")
38
● Data Type of Each Data Column
• Determine the best data type for each column, to assist you in
completing the business glossary, to ensure that you record the
correct import processing rules.
• Example: To find datatype of each column
sapply(INPUT_DATA_with_ID, typeof)
● Minimum Value
• Determine the minimum value in a specific column.
• Example: find minimum value
min(country_histogram$Country)
or
sapply(country_histogram[,'Country'], min, na.rm=TRUE)
● Maximum Value
• Determine the maximum value in a specific column.
• Example: find maximum value
max(country_histogram$Country)
or
sapply(country_histogram[,'Country'], max, na.rm=TRUE)
● Mean
• If the column is numeric in nature, determine the average value in a
specific column.
• Example: find mean of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], mean,
na.rm=TRUE)
● Median
• Determine the value that splits the data set into two parts in a
specific column.
• Example:find median of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], median,
na.rm=TRUE)
39
● Mode
• Determine the value that appears most in a specific column.
• Example: Find mode for column country
INPUT_DATA_COUNTRY_FREQ=data.table(with(INPU
T_DATA_with_ID, table(Country)))
● Range
• For numeric values, you determine the range of the values by taking
the maximum value and subtracting the minimum value.
• Example: find range of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], range,
na.rm=TRUE
● Quartiles
• These are the base values that divide a data set in quarters. This is
done by sorting the data column first and then splitting it in groups
of four equal parts.
• Example: find quartile of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], quantile,
na.rm=TRUE)
● Standard Deviation
• The standard deviation is a measure of the amount of variation or
dispersion of a set of values.
• Example: find standard deviation of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], sd,
na.rm=TRUE)
● Skewness
• Skewness describes the shape or profile of the distribution of the
data in the column.
• Example: find skewness of latitude
library(e1071)
skewness(lattitue_histogram_with_id$Latitude, na.rm =
FALSE, type = 2)
40
● Data Pattern
• I have used the following process for years, to determine a pattern
of the data values themselves.
• Here is my standard version:
• Replace all alphabet values with an uppercase case A, all numbers
with an uppercase N, and replace any spaces with a lowercase letter
band all other unknown characters with a lowercase u.
• As a result, “Data Science 102” becomes
"AAAAbAAAAAAAbNNNu.” This pattern creation is beneficial
for designing any specific assess rules.
● To prevent a data swamp, it is essential that you train your team also.
Data science is a team effort.
● People, process, and technology are the three cornerstones to ensure
that data is curated and protected.
● You are responsible for your people; share the knowledge you acquire
from this book. The process I teach you, you need to teach them.
Alone, you cannot achieve success.
● Technology requires that you invest time to understand it fully. We are
only at the dawn of major developments in the field of data
engineering and data science.
● Remember: A big part of this process is to ensure that business users
and data scientists understand the need to start small, have concrete
questions in mind, and realize that there is work to do with all data to
achieve success.
In this section we discuss two things : shipping terms and Incoterm 2010.
● These determine the rules of the shipment, the conditions under which
it is made. Normally, these are stated on the shipping manifest.
● Following are the terms used:
• Seller - The person/company sending the products on the shipping
manifest is the seller. This is not a location but a legal entity
sending the products.
• Carrier - The person/company that physically carries the products
on the shipping manifest is the carrier. Note that this is not a
location but a legal entity transporting the products.
41
• Port - A Port is any point from which you have to exit or enter a
country. Normally, these are shipping ports or airports but can also
include border crossings via road. Note that there are two ports in
the complete process. This is important. There is a port of exit and
a port of entry.
• Ship - Ship is the general term for the physical transport method
used for the goods. This can refer to a cargo ship, airplane, truck,
or even person, but it must be identified by a unique allocation
number.
• Terminal - A terminal is the physical point at which the goods are
handed off for the next phase of the physical shipping.
• Named Place - This is the location where the ownership is legally
changed from seller to buyer. This is a specific location in the
overall process. Remember this point, as it causes many legal
disputes in the logistics industry.
• Buyer - The person/company receiving the products on the
shipping manifest is the buyer. In our case, there will be
warehouses, shops, and customers. Note that this is not a location
but a legal entity receiving the products.
43
● DAT—Delivered at a Terminal
• According to this term the seller has to deliver and unload the
goods at a named terminal. The seller assumes all risks till the
delivery at the destination and has to pay all incurred costs of
transport including export fees, carriage, unloading from the main
carrier at destination port, and destination port charges.
• The terminal can be a port, airport, or inland freight interchange,
but it must be a facility with the capability to receive the shipment.
If the seller is not able to organize unloading, it should consider
shipping under DAP terms instead. All charges after unloading (for
example, import duty, taxes, customs and on-carriage costs) are to
be borne by buyer.
• The data science version. If I were to buy an item at an overseas
store and then pick it up at a local store before taking it home, and
the overseas shop shipped it—Delivered at Terminal (Local
Shop)—the moment I pay at the register, the ownership is
transferred to me.
• However, if anything happens to the book between the payment
and the pickup, the local shop pays. It is picked up only once at the
local shop. I have to pay if anything happens. So, the moment I
take it, the transaction becomes EXW, so I have to pay any import
duties on arrival in my home.
● DAP—Delivered at Place
• Under this option the seller delivers the goods at a given place of
destination. Here, the risk willpass from seller to buyer from
destination point.
• Packaging cost at the origin has to be paid by the seller alsoall the
legal formalities in the exporting country will be carried out by the
seller at his own expense.
• Once the goods are delivered in the destination country the buyer
has to pay for the customs clearance.
• Here is the data science version. If I were to buy 100 pieces of a
particular item from an overseas web site and then pick up the
copies at a local store before taking them home, and the shop
shipped the copies DAP-Delivered At Place (Local Shop)— the
moment I paid at the register, the ownership would be transferred
to me. However, if anything happened to the item between the
payment and the pickup, the web site owner pays. Once the 100
pieces are picked up at the local shop, I have to pay to unpack
them at store. So, the moment I take the copies, the transaction
becomes EXW, so I will have to pay costs after I take the copies.
● DDP—Delivered Duty Paid
• Here the seller is responsible for the delivery of the products or
goods to an agreed destination place in the country of the buyer.
44
The seller has to pay for all expenses like packing at origin,
delivering the goods to the destination, import duties and taxes,
clearing customs etc.
• The seller is not responsible for unloading. This term DDP will
place the minimum obligations on the buyer and maximum
obligations on the seller. Neither the risk nor responsibility is
transferred to the buyer until delivery of the goods is completed at
the named place of destination.
• Here is the data science version. If I were to buy an item in
quantity 100 at an overseas web site and then pick them up at a
local store before taking them home, and the shop shipped DDP—
Delivered Duty Paid (my home)—the moment I pay at the till, the
ownership is transferred to me. However, if anything were to
happen to the items between the payment and the delivery at my
house, the store must replace the items as the term covers the
delivery to my house.
● While performing data retrieval you may have to work with one of the
following data stores
● SQLite
• This requires a package named sqlite3.
● Microsoft SQL Server
• Microsoft SQL server is common in companies, and this connector
supports your connection to the database. Via the direct
connection, use
from sqlalchemy import create_engine
engine =
create_engine('mssql+pymssql://scott:tiger@hostname:port/folder')
● Oracle
• Oracle is a common database storage option in bigger companies.
It enables you to load data from the following data source with
ease:
from sqlalchemy import create_engine
engine =
create_engine('oracle://andre:[email protected]:1521/vermeulen')
● MySQL
• MySQL is widely used by lots of companies for storing data. This
opens that data to your data science with the change of a simple
connection string.
• There are two options. For direct connect to the database, use
from sqlalchemy import create_engine
45
engine =
create_engine('mysql+mysqldb://scott:tiger@localhost/vermeulen')
● Apache Cassandra
• Cassandra is becoming a widely distributed databaseengine in the
corporate world.
• To access it, use the Python package cassandra.
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect(‘vermeulen’)
● Apache Hadoop
• Hadoop is one of the most successful data lake ecosystems in
highly distributed data Science.
• The pydoop package includes a Python MapReduce and HDFS
API for Hadoop.
● Pydoop 9
• It is a Python interface to Hadoop that allows you to write
MapReduce applications and interact with HDFS in pure Python
● Microsoft Excel
• Excel is common in the data sharing ecosystem, and it enables you
to load files using this format with ease.
● Apache Spark
• Apache Spark is now becoming the next standard for distributed
data processing. The universal acceptance and support of the
processing ecosystem is starting to turn mastery of this technology
into a must-have skill.
● Apache Hive
• Access to Hive opens its highly distributed ecosystem for use by
data scientists.
● Luigi
• Luigi enables a series of Python features that enable you to build
complex pipelines into batch jobs. It handles dependency
resolution and workflow management as part of the package.
• This will save you from performing complex programming while
enabling good quality processing
● Amazon S3 Storage
• S3, or Amazon Simple Storage Service (Amazon S3), creates
simple and practical methods to collect, store, and analyze data,
irrespective of format, completely at massive scale. I store most of
my base data in S3, as it is cheaper than most other methods.
46
• Package s3 - Python’s s3 module connects to Amazon’s S3 REST
API
• Package Boot - The Botopackage is another useful too that
connects to Amazon’s S3 REST API
● Amazon Redshift
• Amazon Redshift is cloud service that is a fully managed,
petabyte-scale data warehouse.
• The Python package redshift-sqlalchemy, is an Amazon Redshift
dialect for sqlalchemythat opens this data source to your data
science
8. List and explain the different data stores used in data science.
REFERENCES
Books:
● Andreas François Vermeulen, “Practical Data Science - A Guide to
Building the Technology Stack for Turning Data Lakes into Business
Assets”
Websites:
● https://www.aitworldwide.com/incoterms
● Incoterm: https://www.ntrpco.com/what-is-incoterms-part2/
*****
47
Unit III
5
ASSESS SUPERSTEP
Unit Structure
5.0 Objectives
5.1 Assess Superstep
5.2 Errors
5.2.1 Accept the Error
5.2.2 Reject the Error
5.2.3 Correct the Error
5.2.4 Create a Default Value
5.3 Analysis of Data
5.3.1 Completeness
5.3.2 Consistency
5.3.3 Timeliness
5.3.4 Conformity
5.3.5 Accuracy
5.3.6 Integrity
5.4 Practical Actions
5.4.1 Missing Values in Pandas
5.4.1.1 Drop the Columns Where All Elements Are Missing
Values
5.4.1.2 Drop the Columns Where Any of the Elements Is
Missing Values
5.4.1.3 Keep Only the Rows That Contain a Maximum of Two
Missing Values
5.4.1.4 Fill All Missing Values with the Mean, Median, Mode,
Minimum, and Maximum of the Particular Numeric
Column
5.5 Let us Sum up
5.6 Unit End Questions
5.7 List of References
5.0 OBJECTIVES
This chapter makes you understand the following concepts:
● Dealing with errors in data
● Principles of data analysis
48
● Different ways to correct errors in data
5.2 ERRORS
Errors are the norm, not the exception, when working with data.
By now, you’ve probably heard the statistic that 88% of spreadsheets
contain errors. Since we cannot safely assume that any of the data we
work with is error-free, our mission should be to find and tackle errors in
the most efficient way possible.
49
User Device OS Transactions
A Mobile Android 5
B Mobile Window 3
C Tablet NA 4
D NA Android 1
E Mobile IOS 2
Table 5.1
In the above case, the entire observation for User C and User D
will be ignored for listwise deletion. b. Pairwise: In this case, only the
missing observations are ignored and analysis is In the above case, 2
separate sample data will be analyzed, one with the combination of User,
Device and Transaction and the other with the combination of User, OS
and Transaction. In such a case, one won't be deleting any observation.
Each of the samples will ignore the variable which has the missing value
in it.
Use reject the error option if you can afford to lose a bit of data.
This is an option to be used only if the number of missing values is 2% of
the whole dataset or less.
5.2.3 Correct the Error:
Identify the Different Error Types:
We are going to look at a few different types of errors. Let’s take
the example of a sample of people described by a number of different
variables:
Table 5.2
50
Can you point out a few inconsistencies? Write them down a few and
check your answers below!
1. First, there are empty cells for the "country" and "date of birth
variables". We call these missing attributes.
2. If you look at the "Country" column, you see a cell that contains 24.
“24” is definitely not a country! This is known as a lexical error.
3. Next, you may notice in the "Height" column that there is an entry
with a different unit of measure. Indeed, Rodney's height is recorded in
feet and inches while the rest are recorded in meters. This is an
irregularity error because the unit of measures are not uniform.
5. Mark has two email addresses. It’s is not necessarily a problem, but if
you forget about this and code an analysis program based on the
assumption that each person has only one email address, your program
will probably crash! This is called a formatting error.
5. Look at the "date of birth" variable. There is also a formatting error
here as Rob’s date of birth is not recorded in the same format as
the others.
6. Samuel appears on two different rows. But, how can we be sure this is
the same Samuel? By his email address, of course! This is called a
duplication error. But look closer, Samuel’s two rows each give a
different value for the "height variable": 1.67m and 1.45m. This is
called a contradiction error.
7. Honey is apparently 9'1". This height diverges greatly from the normal
heights of human beings. This value is, therefore, referred to as
anoutlier.
Figure 5.1
52
One of the causes of data quality issues is in source data that is
housed in a patchwork of operational systems and enterprise applications.
Each of these data sources can have scattered or misplaced values,
outdated and duplicate records, and inconsistent (or undefined) data
standards and formats across customers, products, transactions, financials
and more.
5.3.1 Completeness:
For example, a customer’s first name and last name are mandatory
but middle name is optional; so a record can be considered complete even
if a middle name is not available.
5.3.2 Consistency:
Examples:
● A business unit status is closed but there are sales for that business
unit.
● Employee status is terminated but pay status is active.
Questions you can ask yourself: Are data values the same across the data
sets? Are there any distinct occurrences of the same data instances that
provide conflicting information?
53
5.3.3 Timeliness:
5.3.4 Conformity:
5.3.5 Accuracy:
Accuracy is the degree to which data correctly reflects the real world
object OR an event being described. Examples:
● Sales of the business unit are the real value.
● Address of an employee in the employee database is the real address.
5.3.6 Integrity:
54
Ask yourself: Is there are any data missing important relationship
linkages? The inability to link related records together may actually
introduce duplication across your systems.
5.4.1.1. Drop the Columns Where All Elements Are Missing Values
Importing data:
55
Syntax:
Parameters:
● axis: axis takes int or string value for rows/columns. Input can be 0 or
1 for Integer and ‘index’ or ‘columns’ for String.
● how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’
drops the row/column if ANY value is Null and ‘all’ drops only if
ALL values are null.
● thresh: thresh takes integer value which tells minimum amount of na
values to drop.
● subset: It’s an array which limits the dropping process to passed
rows/columns through list.
● inplace: It is a boolean which makes the changes in data frame itself if
True.
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Table 5.3
Here, column C is having all NaN values. Let’s drop this column. For this
use the following code.
Code:
import pandas as pd
import numpy as np
df.dropna(axis=1, how='all') # this code will delete the columns with all
null values.
56
Here, axis=1 means columns and how=’all’ means drop the columns with
all NaN values.
A B D
0 NaN 2.0 0
1 3.0 4.0 1
2 NaN NaN 5
Table 5.4
5.4.1.2. Drop the Columns Where Any of the Elements Is Missing
Values:Let’s consider the same dataframe again:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Table 5.5
Here, column A, B and C are having all NaN values. Let’s drop these
columns. For this use the following code
Code:
import pandas as pd
import numpy as np
df.dropna(axis=1, how='any') # this code will delete the columns with all
null values.
Here, axis=1 means columns and how=’any’ means drop the
columns with one or noreNaN
values.
D
0 0
1 1
2 5
Table 5.6
57
5.4.1.3. Keep Only the Rows That Contain a Maximum of Two Missing
Values:
Let’s consider the same dataframe again:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Table 5.7
Here, row 2 is having more than 2 NaN values. So, this row will get
dropped. For this use the
following code
Code:
# importing pandas as pd
import pandas as pd
import numpy as np
df
df.dropna(thresh=2)
# this code will delete the rows with more than two null values.
Here, thresh=2 means maximum two NaN will be allowed per row.
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
Table 5.8
5.4.1.4. Fill All Missing Values with the Mean, Median, Mode, Minimum
58
this scenario is to use multiple imputations; several variations of the data
set are created with different estimates of the missing values. The
variations of the data sets are then used as inputs to models and the test
statistic replicates are computed for each imputed data set. From these
replicate statistics, appropriate hypothesis tests can be constructed and
used for decision making.
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8,
12],
[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
df.fillna(df.mean())
59
Output:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8,
12], [15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
60
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3',
'Basket4', 'Basket5', 'Basket6'])
df
df.fillna(df.median())
Output:
61
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 8, 28], [55, np.nan,
8, 12],
[15, 14, np.nan, 12], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
for column in df.columns:
df[column].fillna(df[column].mode()[0], inplace=True)
df
Output:
Apple Orange Banana Pear
Basket 1 10 14 30 40
Basket 2 7 14 8 28
Basket 3 55 14 8 12
Basket 4 15 14 8 12
Basket 5 7 1 1 12
Basket 6 7.0 4 9 2
Table 5.14
Here, the mode of Apple Column = (10, 7, 55, 15, 7) = 7. So, Nan
value is replaced by 7. Similarly, in Orange Column Nan’s are replaced
with 14, in Banana’s column Nan replaced with 8 and in Pear’s column it
is replaced with 12.
Replacing Nan values with min:
Let’s take an example:
Apple Orange Banana Pear
Basket 1 10 NaN 30 40
Basket 2 7 14 21 28
Basket 3 55 NaN 8 12
Basket 4 15 14 NaN 8
Basket 5 7 1 1 NaN
Basket 6 NaN 4 9 2
Table 5.15
62
Here, we can see NaN in all the columns. Let’s fill it by their
minimum value. For this, use the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8,
12],
[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
df.fillna(df.min())
Output:
63
Practical Solutions to solve the missing values were also covered
like Drop the Columns Where All Elements Are Missing Values, Drop the
Columns Where Any of the Elements Is Missing Values, and keep Only
the Rows That Contain a Maximum of Two Missing Values, Fill All
Missing Values with the Mean, Median, Mode, Minimum, and Maximum
of the Particular Numeric Column
LIST OF REFERENCES
● Python for Data Science For Dummies, by Luca Massaron John Paul
Mueller (Author),
● ISBN-13 : 978-8126524938, Wiley
● Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython, 2nd Edition
● by William McKinney (Author), ISBN-13 : 978-9352136414 ,
Shroff/O'Reilly
64
● Data Science From Scratch: First Principles with Python, Second
Edition by Joel Grus,
● ISBN-13 : 978-9352138326, Shroff/O'Reilly
● Data Science from Scratch by Joel Grus, ISBN-13 : 978-
1491901427, O′Reilly
● Data Science Strategy For Dummies by Ulrika Jagare, ISBN-13 :
978-8126533367 , Wiley
● Pandas for Everyone: Python Data Analysis, by Daniel Y. Chen,
ISBN-13 : 978- 9352869169, Pearson Education
● Practical Data Science with R (MANNING) by Nina Zumel, John
Mount, ISBN-13 : 978- 9351194378, Dreamtech Press
*****
65
6
ASSESS SUPERSTEP
Unit Structure
6.0 Objectives
6.1 Engineering a Practical Assess Superstep
6.2 Unit End Questions
6.3 References
6.0 OBJECTIVES
This chapter will make you understand the practical concepts of:
● Assess superstep
● Python NetworkX Library used to draw network routing graphs
● Python Schedule library to schedule various jobs
Network X provides:
● tools for the study of the structure and dynamics of social, biological,
and infrastructure networks;
● a standard programming interface and graph implementation that is
suitable for many applications;
● a rapid development environment for collaborative, multidisciplinary
projects;
● an interface to existing numerical algorithms and code written in C,
C++, and FORTRAN; and the ability to painlessly work with large
nonstandard data sets.
With NetworkX you can load and store networks in standard and
nonstandard data formats, generate many types of random and classic
66
networks, analyze network structure, build network models, design new
network algorithms, draw networks, and much more.
Graph Theory:
Figure 6.1
The total number of edges for each node is the degree of that node.
In the Graph above, M has a degree of 2 ({M,H} and {M,L}) while B has
a degree of 1 ({B,A}). Degree is described formally as:
67
Connections through use of multiple edges are called paths. {F, H,
M, L, H, J, G, I} is an example of a path. A simple path is when a path
does not repeat a node — formally known as Eulerian path. {I, G, J, H, F}
is an example of a simple path. The shortest simple path is called
Geodesic. Geodesic between I and J is {I, G, J} or {I, K, J}. Finally, a
cycle is when a path’s start and end points are the same (ex. {H,M,L,H}).
In some notebooks, a cycle is formally referred to as Eulerian cycle.
Figure 6.2
Neo4J’s book on graph algorithms provides a clear summary
68
Figure 6.3
For example:
# ### Creating a graph
# Create an empty graph with no nodes and no edges.
import networkx as nx
G = nx.Graph()
# The graph `G` can be grown in several ways. NetworkX includes many
graph generator
# functions # and facilities to read and write graphs in many formats.
# To get started # though we’ll look at simple manipulations. You can add
one node at a
# time,
G.add_node(1)
G.add_nodes_from([2, 3])
# The graph `G` now contains `H` as a node. This flexibility is very
powerful as it allows
# graphs of graphs, graphs of files, graphs of functions and much more. It
is worth thinking
# about how to structure # your application so that the nodes are useful
entities. Of course
# you can always use a unique identifier # in `G` and have a separate
dictionary keyed by
# identifier to the node information if you prefer.
# # Edges
# `G` can also be grown by adding one edge at a time,
G.add_edge(1, 2)
e = (2, 3)
G.add_edge(*e) # unpack edge tuple*
# we add new nodes/edges and NetworkX quietly ignores any that are
already present.
# At this stage the graph `G` consists of 8 nodes and 3 edges, as can be
seen by:
G.number_of_nodes()
70
G.number_of_edges()
list(G.nodes)
list(G.edges)
list(G.adj[1]) # or list(G.neighbors(1))
G.degree[1] # the number of edges incident to 1
# One can specify to report the edges and degree from a subset of all
nodes using an
#nbunch.
# An *nbunch* is any of: `None` (meaning all nodes), a node, or an
iterable container of nodes that is # not itself a node in the graph.
G.edges([2, 'm'])
G.degree([2, 3])
G.remove_node(2)
G.remove_nodes_from("spam")
71
list(G.nodes)
G.remove_edge(1, 3)
G.add_edge(1, 2)
H = nx.DiGraph(G) # create a DiGraph using the connections from G
list(H.edges())
edgelist = [(0, 1), (1, 2), (2, 3)]
H = nx.Graph(edgelist)
# As an example, `n1` and `n2` could be protein objects from the RCSB
Protein Data Bank,
#and `x` # could refer to an XML record of publications detailing
experimental observations
#of their interaction.
# We have found this power quite useful, but its abuse can lead to
surprising behavior
#unless one is # familiar with Python.
# If in doubt, consider using `convert_node_labels_to_integers()` to obtain
a more
traditional graph with # integer labels. Accessing edges and neighbors
# In addition to the views `Graph.edges`, and `Graph.adj`, access to edges
and neighbors is
#possible using subscript notation.
72
G.add_edge(1, 3)
G[1][3]['color'] = "blue"
G.edges[1, 2]['color'] = "red"
G.edges[1, 2]
FG = nx.Graph()
FG.add_weighted_edges_from([(1, 2, 0.125), (1, 3, 0.75), (2, 4, 1.2), (3, 4,
0.375)])
for n, nbrs in FG.adj.items():
for nbr, eattr in nbrs.items():
wt = eattr['weight']
if wt< 0.5: print(f"({n}, {nbr}, {wt:.3})")
# Convenient access to all edges is achieved with the edges property
G = nx.Graph(day="Friday")
G.graph
G.graph['day'] = "Monday"
G.graph
73
# # Node attributes
# Add node attributes using `add_node()`, `add_nodes_from()`, or
`G.nodes`
G.add_node(1, time='5pm')
G.add_nodes_from([3], time='2pm')
G.nodes[1]
G.nodes[1]['room'] = 714
G.nodes.data()
# Note that adding a node to `G.nodes` does not add it to the graph, use
# `G.add_node()` to add new nodes. Similarly for edges.
# # Edge Attributes
# Add/change edge attributes using `add_edge()`, `add_edges_from()`,
# or subscript notation.
G.add_edge(1, 2, weight=4.7 )
G.add_edges_from([(3, 4), (4, 5)], color='red')
G.add_edges_from([(1, 2, {'color': 'blue'}), (2, 3, {'weight': 8})])
G[1][2]['weight'] = 4.7
G.edges[3, 4]['weight'] = 4.2
DG = nx.DiGraph()
DG.add_weighted_edges_from([(1, 2, 0.5), (3, 1, 0.75)])
DG.out_degree(1, weight='weight')
DG.degree(1, weight='weight')
list(DG.successors(1))
list(DG.neighbors(1))
# Some algorithms work only for directed graphs and others are not well
# defined for directed graphs. Indeed the tendency to lump directed
# and undirected graphs together is dangerous. If you want to treat
# a directed graph as undirected for some measurement you should
probably
# convert it using `Graph.to_undirected()` or with
74
H = nx.Graph(G) # create an undirected graph H from a directed graph G
# # Multigraphs
# NetworkX provides classes for graphs which allow multiple edges
# between any pair of nodes. The `MultiGraph` and
# `MultiDiGraph`
# classes allow you to add the same edge twice, possibly with different
# edge data. This can be powerful for some applications, but many
# algorithms are not well defined on such graphs.
# Where results are well defined,
# e.g., `MultiGraph.degree()` we provide the function. Otherwise you
# should convert to a standard graph in a way that makes the measurement
well defined
MG = nx.MultiGraph()
MG.add_weighted_edges_from([(1, 2, 0.5), (1, 2, 0.75), (2, 3, 0.5)])
dict(MG.degree(weight='weight'))
GG = nx.Graph()
for n, nbrs in MG.adjacency():
for nbr, edict in nbrs.items():
minvalue = min([d['weight'] for d in edict.values()])
GG.add_edge(n, nbr, weight = minvalue)
nx.shortest_path(GG, 1, 3)
K_5 = nx.complete_graph(5)
K_3_5 = nx.complete_bipartite_graph(3, 5)
barbell = nx.barbell_graph(10, 10)
lollipop = nx.lollipop_graph(10, 20)
er = nx.erdos_renyi_graph(100, 0.15)
ws = nx.watts_strogatz_graph(30, 3, 0.1)
ba = nx.barabasi_albert_graph(100, 5)
red = nx.random_lobster(100, 0.9, 0.9)
75
nx.write_gml(red, "path.to.file")
mygraph = nx.read_gml("path.to.file")
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3)])
G.add_node("spam") # adds node "spam"
list(nx.connected_components(G))
sorted(d for n, d in G.degree())
nx.clustering(G)
# Some functions with large output iterate over (node, value) 2-tuples.
# These are easily stored in a
[dict](https://docs.python.org/3/library/stdtypes.html#dict)
# structure if you desire.
sp = dict(nx.all_pairs_shortest_path(G))
sp[3]
G = nx.petersen_graph()
plt.subplot(121)
nx.draw(G, with_labels=True, font_weight='bold')
plt.subplot(122)
nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True,
font_weight='bold')
# when drawing to an interactive display. Note that you may need to issue
a Matplotlib
76
plt.show()
options = {
'node_color': 'black',
'node_size': 100,
'width': 3,
}
plt.subplot(221)
nx.draw_random(G, **options
plt.subplot(222)
nx.draw_circular(G, **options)
plt.subplot(223)
nx.draw_spectral(G, **options)
plt.subplot(224)
nx.draw_shell(G, nlist=[range(5,10), range(5)], **options)
77
Figure 6.4
plt.show()
options = {
'node_color': 'black',
'node_size': 100,
'width': 3,
}
plt.subplot(221)
nx.draw_random(G, **options)
plt.subplot(222)
nx.draw_circular(G, **options)
plt.subplot(223)
nx.draw_spectral(G, **options)
plt.subplot(224)
nx.draw_shell(G, nlist=[range(5,10), range(5)], **options
Figure 6.5
78
G = nx.dodecahedral_graph()
shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12,
13]]
nx.draw_shell(G, nlist=shells, **options)
nx.draw(
Figure 6.6
nx.draw(G)
plt.savefig("path.png")
Installation:
$ pip install schedule
schedule.Scheduler class:
● schedule.every(interval=1) : Calls every on the default scheduler
instance. Schedule a new periodic job.
● schedule.run_pending() : Calls run pending on the default scheduler
instance. Run all jobs that are scheduled to run.
● schedule.run_all(delay_seconds=0) : Calls run_all on the default
scheduler instance. Run all jobs regardless if they are scheduled to
run or not.
schedule.idle_seconds() : Calls idle_seconds on the default scheduler
instance.
● schedule.next_run() : Calls next_run on the default scheduler
instance. Datetime when
the next job should run.
● schedule.cancel_job(job) : Calls cancel_job on the default scheduler
instance. Delete a scheduled job.
● schedule.Job(interval, scheduler=None) class
A periodic job as used by Scheduler.
Parameters:
● interval: A quantity of a certain time unit
● scheduler: The Scheduler instance that this job will register itself
with once it has been fully configured in Job.do().
80
● do(job_func, *args, **kwargs) : Specifies the job_func that should
be called every time the job runs. Any additional arguments are
passed on to job_func when the job runs.
Parameters: job_func – The function to be scheduled. Returns: The
invoked job instance
● run() : Run the job and immediately reschedule it. Returns: The
return value returned by the job_func
● to(latest) : Schedule the job to run at an irregular (randomized)
interval. For example, every(A).to(B).seconds executes the job
function every N seconds such that A <= N <= B.
For example
# Schedule Library imported
import schedule
import time
# Functions setup
def placement():
print("Get ready for Placement at various companies")
def good_luck():
print("Good Luck for Test")
def work():
print("Study and work hard")
def bedtime():
print("It is bed time go rest")
def datascience():
print("Data science with python is fun")
# Task scheduling
# After every 10mins datascience() is called.
schedule.every(10).minutes.do(datascience
81
# Loop so that the scheduling task
# keeps on running all time.
while True:
1. Write Python program to create the network routing diagram from the
given data.
2. Write a Python program to build directed acyclic graph.
3. Write a Python program to pick the content for Bill Boards from the
given data.
4. Write a Python program to generate visitors data from the given csv
file.
REFERENCES
● Python for Data Science For Dummies, by Luca Massaron John Paul
Mueller (Author),
● ISBN-13 : 978-8126524938, Wiley
● Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython, 2nd Edition by William McKinney (Author), ISBN-13 :
978-9352136414 , Shroff/O'Reilly
● Data Science From Scratch: First Principles with Python, Second
Edition by Joel Grus, ISBN-13 : 978-9352138326, Shroff/O'Reilly
● Data Science from Scratch by Joel Grus, ISBN-13 : 978-1491901427
,O′Reilly
● Data Science Strategy For Dummies by Ulrika Jagare, ISBN-13 :
978-8126533367 , Wiley
● Pandas for Everyone: Python Data Analysis, by Daniel Y. Chen,
ISBN-13 : 978- 9352869169, Pearson Education
● Practical Data Science with R (MANNING) by Nina Zumel, John
Mount, ISBN-13 : 978- 9351194378, Dreamtech Press
*****
82
Unit IV
7
PROCESS SUPERSTEP
Unit Structure
7.0 Objectives
7.1 Introduction
7.2 Data Vault
7.2.1 Hubs
7.2.2 Links
7.2.3 Satellites
7.2.4 Reference Satellites
7.3 Time-Person-Object-Location-Event Data Vault
7.4 Time Section
7.4.1 Time Hub
7.4.2 Time Links
7.4.3 Time Satellites
7.5 Person Section
7.5.1 Person Hub
7.5.2 Person Links
7.5.3 Person Satellites
7.6 Object Section
7.6.1 Object Hub
7.6.2 Object Links
7.6.3 Object Satellites
7.7 Location Section
7.7.1 Location Hub
7.7.2 Location Links
7.7.3 Location Satellites
7.8 Event Section
7.8.1 Event Hub
7.8.2 Event Links
7.8.3 Event Satellites
7.9 Engineering a Practical Process Superstep
7.9.1 Event
7.9.2 Explicit Event
7.9.3 Implicit Event
7.10 5-Whys Technique
7.10.1 Benefits of the 5 Whys
83
7.10.2 When Are the 5 Whys Most Useful?
7.10.3 How to Complete the 5 Whys
7.11 Fishbone Diagrams
7.12 Carlo Simulation
7.13 Causal Loop Diagrams
7.14 Pareto Chart
7.15 Correlation Analysis
7.16 Forecasting
7.17 Data Science
7.0 OBJECTIVES
7.2 INTRODUCTION
84
7.2.1 Hubs:
Data vault hub is used to store business key. These keys do not
change over time. Hub also contains a surrogate key for each hub entry
and metadata information for a business key.
7.2.2 Links:
7.2.3 Satellites:
7.4.1Time Hub:
Following are the time links that can be stored as separate links.
● Time-Person Link
• This link connects date-time values from time hub to person hub.
• Dates such as birthdays, anniversaries, book access date, etc.
86
● Time-Object Link
• This link connects date-time values from time hub to object hub.
• Dates such as when you buy or sell car, house or book, etc.
● Time-Location Link
• This link connects date-time values from time hub to location hub.
• Dates such as when you moved or access book from post code, etc.
● Time-Event Link
• This link connects date-time values from time hub to event hub.
• Dates such as when you changed vehicles, etc.
Time satellite can be used to move from one time zone to other
very easily. This feature will be used during Transform superstep.
Person section contains data structure to store all data related to person.
87
7.5.2 Person Links:
Person Links connect person hub to other hubs.
88
7.6 OBJECT SECTION
Object section contains data structure to store all data related to object.
Following are the object links that can be stored as separate links.
● Object-Time Link
• This link contains relationship between Object hub and time hub.
● Object-Person Link
• This link contains relationship between Object hub and Person
hub.
89
● Object-Location Link
• This link contains relationship between Object hub and Location
hub.
● Object-Event Link
• This link contains relationship between Object hub and event hub.
90
Figure 7.6 Location Link
Following are the location links that can be stored as separate links.
● Location-Time Link
• This link contains relationship between location hub and time hub.
● Location-Person Link
• This link contains relationship between location hub and person
hub.
● Location-Object Link
• This link contains relationship between location hub and object
hub.
● Location-Event Link
• This link contains relationship between location hub and event
hub.
7.7.3 Location Satellites:
Location satellites are part of vault that contains locations of entities.
91
7.8 EVENT SECTION
Event hub contains various fields that stores real world events.
Following are the time links that can be stored as separate links.
● Event-Time Link
• This link contains relationship between event hub and time hub.
● Event-Person Link
• This link contains relationship between event hub and person hub.
92
● Event-Object Link
• This link contains relationship between event hub and object hub.
● Event-Location Link
• This link contains relationship between event hub and location
hub.
Time:
Year
The standard uses four digits to represent year. The values ranges from
0000 to 9999.
AD/BC requires conversion
Year Conversion
N AD Year N
3 AD Year 3
1 AD Year 1
1 BC Year 0
2 BC Year – 1
2020AD +2020
2020BC -2019 (year -1 for BC)
Table 7.1
from datetime import datetime
from pytz import timezone, all_timezones
now_date = datetime(2020,1,2,3,4,5,6)
now_utc=now_date.replace(tzinfo=timezone('UTC'))
93
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Year:',str(now_utc.strftime("%Y")))
Output:
Month:
The standard uses two digits to represent month. The values ranges from
01 to 12.
Output:
Number Name
01 January
02 February
03 March
04 April
05 May
06 June
04.1 July
08 August
09 September
10 October
11 November
12 December
Table 7.2
Day
The standard uses two digits to represent month. The values ranges from
01 to 31.
94
The rule for a valid month is 22 January 2020 becomes 2020-01-22 or
+2020-01-22.
Output:
Hour:
The standard uses two digits to represent hour. The values ranges from 00
to 24.
The use of 00:00:00 is the beginning of the calendar day. The use of
24:00:00 is only to indicate the end of the calendar day.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Hour:',str(now_utc.strftime("%H")))
Output:
Minute:
The standard uses two digits to represent minute. The values ranges from
00 to 59.
The standard minute must use two-digit values within the range of 00
through 59.
Output:
95
Second:
The standard uses two digits to represent second. The values ranges from
00 to 59.
Output:
96
Output:
7.9.1 Event:
This type of event is stated in the data source clearly and with full
details. There is cleardata to show that the specific action was performed.
Explicit events are the events that the source systems supply, as
these have directdata that proves that the specific action was performed.
These three events would imply that Mr. Vermeulen entered room
302 as an event. Not true!
97
7.10.1 Benefits of the 5 Whys:
The 5 Whys assist the data scientist to identify the root cause of a
problem and determine the relationship between different root causes of
the same problem. It is one of the simplest investigative tools—easy to
complete without intense statistical analysis.
The 5 Whys are most useful for finding solutions to problems that
involve human factors or interactions that generate multi-layered data
problems. In day-to-day business life, they can be used in real-world
businesses to find the root causes of issues.
Write down the specific problem. This will help you to formalize
the problem and describe it completely. It also helps the data science team
to focus on the same problem. Ask why the problem occurred and write
the answer below the problem. If the answer you provided doesn’t identify
the root cause of the problem that you wrote down first, ask why again,
and write down that answer. Loop back to the preceding step until you and
your customer are in agreement that the problem’s root cause is identified.
Again, this may require fewer or more than the5 Whys.
Example:
99
7.12 MONTE CARLO SIMULATION
100
7.14 PARETO CHART
import pandas as pd
a = [ [1, 2, 4], [5, 4.1, 9], [8, 3, 13], [4, 3, 19], [5, 6, 12], [5, 6, 11],[5, 6,
4.1], [4, 3, 6]]
df = pd.DataFrame(data=a)
101
cr=df.corr()
print(cr)
7.16 FORECASTING
102
SUMMARY
The Process superstep uses the assess results of the retrieve process
from the data sources into a highly structured data vaults that acts as basic
data structure for the remaining data science steps.
14. Explain the significance of Monte Carlo Simulation and Causal Loop
Diagram.
15. What are pareto charts? What information can be obtained from pareto
charts?
16. Explain the use of correlation and forecasting in data science.
17. State and explain the five steps of data science.
REFERENCES
https://asq.org/
https://scikit-learn.org/
https://www.geeksforgeeks.org/
https://statistics.laerd.com/spss-tutorials/
https://www.kdnuggets.com/
*****
103
8
TRANSFORM SUPERSTEP
Unit Structure
8.0 Objectives
8.1 Introduction
8.2 Dimension Consolidation
8.3 Sun Model
8.3.1 Person-to-Time Sun Model
8.3.2 Person-to-Object Sun Model
8.3.3 Person-to-Location Sun Model
8.3.4 Person-to-Event Sun Model
8.3.5 Sun Model to Transform Step
8.4 Transforming with Data Science
8.5 Common Feature Extraction Techniques
8.5.1 Binning
8.5.2 Averaging
8.6 Hypothesis Testing
8.6.1 T-Test
8.6.2 Chi-Square Test
8.7 Overfitting & Underfitting
8.7.1 Polynomial Features
8.7.2 Common Data-Fitting Issue
8.8 Precision-Recall
8.8.1 Precision-Recall Curve
8.8.2 Sensitivity & Specificity
8.8.3 F1-Measure
8.8.4 Receiver Operating Characteristic (ROC) Analysis Curves
8.9 Cross-Validation Test
8.10 Univariate Analysis
8.11 Bivariate Analysis
8.12 Multivariate Analysis
8.13 Linear Regression
8.13.1 Simple Linear Regression
8.13.2 RANSAC Linear Regression
8.13.3 Hough Transform
8.14 Logistic Regression
8.14.1 Simple Logistic Regression
8.14.2 Multinomial Logistic Regression
8.14.3 Ordinal Logistic Regression
8.15 Clustering Techniques
104
8.15.1 Hierarchical Clustering
8.15.2 Partitional Clustering
8.16 ANOVA
Decision Trees
8.0 OBJECTIVES
The objective of this chapter is to learn data transformation
techniques, feature extraction techniques, missing datahandling, and
various techniques to categorise data into suitable groups.
8.1 INTRODUCTION
The Transform Superstep allow us to take data from data vault and
answer the questions raised by the investigation.
The Transform Superstep uses the data vault from the process step as
its source data.
106
8.3.2 Person-to-Object Sun Model:
107
Figure 8.6 Sun model for PersonBorn fact
You must build three items: dimension Person, dimension Time, and
fact
PersonBornAtTime. Open your Python editor and create a file named
Transform-
import sys
import os
from datetime import datetime
from pytz import timezone
import pandas as pd
import sqlite3 as sq
import uuid
pd.options.mode.chained_assignment = None
############################################################
####
if sys.platform == 'linux' or sys.platform == ' Darwin':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
############################################################
####
Company='01-Vermeulen'
############################################################
####
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
108
if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/Vermeulen.db'
conn1 = sq.connect(sDatabaseName)
############################################################
####
sDataWarehousetDir=Base + '/99-DW'
if not os.path.exists(sDataWarehousetDir):
os.makedirs(sDataWarehousetDir)
sDatabaseName=sDataWarehousetDir + '/datawarehouse.db'
conn2 = sq.connect(sDatabaseName)
print('\n#################################')
print('Time Dimension')
BirthZone = 'Atlantic/Reykjavik'
BirthDateUTC = datetime(1960,12,20,10,15,0)
BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d
%H:%M:%S")
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d
%H:%M:%S (%Z)
(%z)")
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
############################################################
####
IDTimeNumber=str(uuid.uuid4())
TimeLine=[('TimeID', [IDTimeNumber]),
('UTCDate', [BirthDateZoneStr]),
('LocalTime', [BirthDateLocal]),
('TimeZone', [BirthZone])]
TimeFrame = pd.DataFrame.from_items(TimeLine)
############################################################
####
DimTime=TimeFrame
DimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)
sTable = 'Dim-Time'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
109
DimTimeIndex.to_sql(sTable, conn2, if_exists="replace")
print('\n#################################')
print('Dimension Person')
print('\n#################################')
FirstName = 'Guðmundur'
LastName = 'Gunnarsson'
############################################################
###
IDPersonNumber=str(uuid.uuid4())
PersonLine=[('PersonID', [IDPersonNumber]),
('FirstName', [FirstName]),
('LastName', [LastName]),
('Zone', ['UTC']),
('DateTimeValue', [BirthDateZoneStr])]
PersonFrame = pd.DataFrame.from_items(PersonLine)
############################################################
####
DimPerson=PersonFrame
DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)
############################################################
####
sTable = 'Dim-Person'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
print('\n#################################')
print('Fact - Person - time')
print('\n#################################')
IDFactNumber=str(uuid.uuid4())
PersonTimeLine=[('IDNumber', [IDFactNumber]),
('IDPersonNumber', [IDPersonNumber]),
('IDTimeNumber', [IDTimeNumber])]
PersonTimeFrame = pd.DataFrame.from_items(PersonTimeLine)
############################################################
####
FctPersonTime=PersonTimeFrame
FctPersonTimeIndex=FctPersonTime.set_index(['IDNumber'],inplace=Fal
se)
110
############################################################
####
sTable = 'Fact-Person-Time'
print('\n#################################')
print('Storing:',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
FctPersonTimeIndex.to_sql(sTable, conn1, if_exists="replace")
FctPersonTimeIndex.to_sql(sTable, conn2, if_exists="replace")
Gunnarsson- Sun-Model.py in directory
Missing data in the training data set can reduce the power / fit of a
model or can lead to a biased model because we have not analyzed the
behavior and relationship with other variables correctly. It can lead to
wrong prediction or classification.
111
8.5.1 Binning:
Binning technique is used to reduce the complexity of data sets, to
enable the data scientist to evaluate the data with an organized grouping
technique.
Binning is a good way for you to turn continuous data into a data
set that has specific features that you can evaluate for patterns. For
example, if you have data about a group of people, you might want to
arrange their ages into a smaller number of age intervals (for example,
grouping every five years together).
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
print(bin_means)
#The second is to use the histogram function.
bin_means2 = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
print(bin_means2)
8.5.2 Averaging:
Example:
Create a model that enables you to calculate the average position for ten
sample points. First, set up the ecosystem.
import numpy as np
import pandas as pd
#Create two series to model the latitude and longitude ranges.
LatitudeData = pd.Series(np.array(range(-90,91,1)))
LongitudeData = pd.Series(np.array(range(-14.20,14.21,1)))
#Select 10 samples for each range:
LatitudeSet=LatitudeData.sample(10)
LongitudeSet=LongitudeData.sample(10)
#Calculate the average of each data set
LatitudeAverage = np.average(LatitudeSet)
LongitudeAverage = np.average(LongitudeSet)
#See the results
112
print('Latitude')
print(LatitudeSet)
print('Latitude (Avg):',LatitudeAverage)
print('##############')
print('Longitude')
print(LongitudeSet)
print('Longitude (Avg):', LongitudeAverage)
8.6.1 T-Test:
The t-test is one of many tests used for the purpose of hypothesis
testing in statistics. A t-test is a popular statistical test to make inferences
about single means or inferences about two means or variances, to check if
the two groups’ means are statistically different from each other, where
n(sample size) < 30 and standard deviation is unknown.
113
tset, pval = ttest_1samp(ages, 30)
print('p-values - ',pval)
if pval< 0.05:
print("we reject null hypothesis")
else:
print("we fail to reject null hypothesis")
A chi-square (or squared [χ2]) test is used to check if two variables are
significantly different from each other. These variables are categorical.
import numpy as np
import pandas as pd
import scipy.stats as stats
np.random.seed(10)
stud_grade = np.random.choice(a=["O","A","B","C","D"],
p=[0.20, 0.20 ,0.20, 0.20, 0.20], size=100)
stud_gen = np.random.choice(a=["Male","Female"], p=[0.5, 0.5],
size=100)
mscpart1 = pd.DataFrame({"Grades":stud_grade, "Gender":stud_gen})
print(mscpart1)
stud_tab = pd.crosstab(mscpart1.Grades, mscpart1.Gender, margins=True)
stud_tab.columns = ["Male", "Female", "row_totals"]
stud_tab.index = ["O", "A", "B", "C", "D", "col_totals"]
observed = stud_tab.iloc[0:5, 0:2 ]
print(observed)
expected = np.outer(stud_tab["row_totals"][0:5],
stud_tab.loc["col_totals"][0:2]) / 100
print(expected)
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print('Calculated : ',chi_squared_stat)
crit = stats.chi2.ppf(q=0.95, df=4)
print('Table Value : ',crit)
if chi_squared_stat>= crit:
print('H0 is Accepted ')
else:
print('H0 is Rejected ')
114
8.7 OVERFITTING & UNDERFITTING
Overfitting occurs when the model or the algorithm fits the data
too well. When a model gets trained with so much of data, it starts
learning from the noise and inaccurate data entries in our data set. But the
problem then occurred is, the model will not be able to categorize the data
correctly, and this happens because of too much of details and noise.
Example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
def f(x):
""" function to approximate by polynomial interpolation"""
115
return x * np.sin(x)
# generate points used to plot
x_plot = np.linspace(0, 10, 100)
# generate points and keep a subset of them
x = np.linspace(0, 10, 100)
rng = np.random.RandomState(0)
rng.shuffle(x)
x = np.sort(x[:20])
y = f(x)
# create matrix versions of these arrays
X = x[:,np.newaxis]
X_plot = x_plot[:,np.newaxis]
colors = ['teal', 'yellowgreen', 'gold']
lw = 2
plt.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw,
label="Ground Truth")
plt.scatter(x, y, color='navy', s=30, marker='o', label="training points")
for count, degree in enumerate([3, 4, 5]):
model = make_pipeline(PolynomialFeatures(degree), Ridge())
model.fit(X, y)
y_plot = model.predict(X_plot)
plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw,
label="Degree %d" % degree)
plt.legend(loc='lower left')
plt.show()
Example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
def true_fun(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
116
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i],
include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X[:, np.newaxis], y,
scoring="neg_mean_squared_error", cv=10)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format( degrees[i], -
scores.mean(), scores.std()))
plt.show()
8.8 PRECISION-RECALL
117
false positive rate, and high recall relates to a low false negative rate. High
scores for both show that the classifier is returning accurate results (high
precision), as well as returning a majority of all positive results (high
recall).
A system with high recalls but low precision returns many results,
but most of its predicted labels are incorrect when compared to the
training labels. A system with high precision but low recall is just the
opposite, returning very few results, but most of its predicted labels are
correct when compared to the training labels. An ideal system with high
precision and high recall will return many results, with all results labelled
correctly.
Precision (P) is defined as the number of true positives (Tp) over the
number of true positives (Tp) plus the number of false positives (Fp).
Recall (R) is defined as the number of true positives (Tp) over the
number of true positives (Tp) plus the number of false negatives (Fn).
The true negative rate (TNR) is the rate that indicates the recall of the
negative items.
118
8.8.3 F1-Measure:
The F1-score is a measure that combines precision and recall in the
harmonic mean of precision and recall.
You will find the ROC analysis curves useful for evaluating
whether your classification or feature engineering is good enough to
determine the value of the insights you are finding. This helps with
repeatable results against a real-world data set. So, if you suggest that your
customers should take as pecific action as a result of your findings, ROC
analysis curves will support your advice and insights but also relay the
quality of the insights at given parameters.
Example:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
import matplotlib.pyplot as plt
119
digits = datasets.load_digits()
X = digits.data
y = digits.target
Let’s pick three different kernels and compare how they will perform.
Title="Kernel:>" + kernel
fig=plt.figure(1, figsize=(4.2, 6))
plt.clf()
fig.suptitle(Title, fontsize=20)
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('Cross-Validation Score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)
plt.show()
Table 8.1
Suppose that the heights of seven students of a class is recorded (in
the above figure), there is only one variable that is height and it is not
dealing with any cause or relationship. The description of patterns found
in this type of data can be made by drawing conclusions using central
tendency measures (mean, median and mode), dispersion or spread of data
(range, minimum, maximum, quartiles, variance and standard deviation)
and by using frequency distribution tables, histograms, pie charts,
frequency polygon and bar charts.
Table 8.2
Suppose the temperature and ice cream sales are the two variables
of a bivariate data (in the above figure). Here, the relationship is visible
from the table that temperature and sales are directly proportional to each
other and thus related because as the temperature increases, the sales also
increase. Thus, bivariate data analysis involves comparisons, relationships,
causes and explanations. These variables are often plotted on X and Y axis
on the graph for better understanding of data and one of these variables is
independent while the other is dependent.
121
8.12 MULTIVARIATE ANALYSIS
122
8.13.1 Simple Linear Regression:
Figure 8.8
8.13.2 RANSAC Linear Regression:
124
income, and gender as well as the age of an existing automobile. The
training set would also include the outcome variable on whether the
person purchased a new automobile over a 12-month period. The logistic
regression model provides the likelihood or probability of a person making
a purchase in the next 12 months.
• Finance: Using a loan applicant's credit history and the details on the
loan, determine the probability that an applicant will default on the
loan. Based on the prediction, the loan can be approved or denied, or
the terms can be modified.
Simple logistic regression can be used when you have one nominal
variable with two values (male/female, dead/alive, etc.) and one
measurement variable. The nominal variable is the dependent variable,
and the measurement variable is the independent variable. Logistic
Regression, also known as Logit Regression or Logit Model. Logistic
Regression works with binary data, where either the event happens (1) or
the event does not happen (0).
125
Simple logistic regression is analogous to linear regression, except
that the dependent variable is nominal, not a measurement. One goal is to
see whether the probability of getting a particular value of the nominal
variable is associated with the measurement variable; the other goal is to
predict the probability of getting a particular value of the nominal variable,
given the measurement variable.
For example, you could use ordinal regression to predict the belief
that "tax is too high" (your ordinal dependent variable, measured on a 4-
point Likert item from "Strongly Disagree" to "Strongly Agree"), based on
two independent variables: "age" and "income". Alternately, you could use
ordinal regression to determine whether a number of independent
variables, such as "age", "gender", "level of physical activity" (amongst
others), predict the ordinal dependent variable, "obesity", where obesity is
measured using three ordered categories: "normal", "overweight" and
"obese".
For example: All files and folders on our hard disk are organized in a
hierarchy.
8.16 ANOVA
The ANOVA test is the initial step in analysing factors that affect a
given data set. Once the test is finished, an analyst performs additional
testing on the methodical factors that measurably contribute to the data
set's inconsistency. The analyst utilizes the ANOVA test results in an f-
test to generate additional data that aligns with the proposed regression
models. The ANOVA test allows a comparison of more than two groups at
the same time to determine whether a relationship exists between them.
Example:
131
Example:
Internal nodes are the decision or test points. Each internal node
refers to an input variable or an attribute. The top internal node is called
the root. The decision tree in the above figure is a binary tree in that each
internal node has no more than two branches. The branching of a node is
referred to as a split.
The decision tree in the above figure shows that females with
income less than or equal to $45,000 and males 40 years old or younger
are classified as people who would purchase the product. In traversing this
tree, age does not matter for females, and income does not matter for
males.
132
Where decision tree is used?
• Decision trees are widely used in practice.
• To classify animals, questions (like cold-blooded or warm-blooded,
mammal or not mammal) are answered to arrive at a certain
classification.
• A checklist of symptoms during a doctor's evaluation of a patient.
• The artificial intelligence engine of a video game commonly uses
decision trees to control the autonomous actions of a character in
response to various scenarios.
• Retailers can use decision trees to segment customers or predict
response rates to marketing and promotions.
• Financial institutions can use decision trees to help decide if a loan
application should be approved or denied. In the case of loan approval,
computers can use the logical if - then statements to predict whether
the customer will default on the loan.
SUMMARY
The Transform superstep allows us to take data from the data vault
and formulate answers to questions raised by the investigations. The
transformation step is the data science process that converts results into
insights.
133
11. Explain precision recall, precision recall curve, sensitivity,
specificity and F1 measure.
12. Explain Univariate Analysis.
13. Explain Bivariate Analysis.
14. What is Linear Regression? Give some common application of linear
regression in the real world.
15. What is Simple Linear Regression? Explain.
16. Write a note on RANSAC Linear Regression.
17. Write a note on Logistic Regression.
18. Write a note on Simple Logistic Regression.
19. Write a note on Multinomial Logistic Regression.
20. Write a note on Ordinal Logistic Regression.
21. Explain CLustering techniques.
22. Explain Receiver Operating Characteristic (ROC) Analysis Curves
and cross validation test.
23. Write a note on ANOVA.
24. Write a note on Decision Trees.
REFERENCES
https://asq.org/
https://scikit-learn.org/
https://www.geeksforgeeks.org/
https://statistics.laerd.com/spss-tutorials/
https://www.kdnuggets.com/
*****
134
9
TRANSFORM SUPERSTEP
Unit Structure
9.0 Objectives
9.1 Introduction
9.2 Overview
9.3 Dimension Consolidation
9.4 The SUN Model
9.5 Transforming with data science
9.5.1 Missing value treatment
9.5.2 Techniques of outlier detection and Treatment
9.6 Hypothesis testing
9.7 Chi-square test
9.8 Univariate Analysis.
9.9 Bivariate Analysis
9.10 Multivariate Analysis
9.11 Linear Regression
9.12 Logistic Regression
9.13 Clustering Techniques
9.14 ANOVA
9.15 Principal Component Analysis (PCA)
9.16 Decision Trees
9.17 Support Vector Machines
9.18 Networks, Clusters, and Grids
9.19 Data Mining
9.20 Pattern Recognition
9.21 Machine Learning
9.22 Bagging Data
9.23 Random Forests
9.24 Computer Vision (CV)
9.25 Natural Language Processing (NLP)
9.26 Neural Networks
9.27 TensorFlow
9.0 OBJECTIVES
9.2 OVERVIEW
As to explain the below scenario is shown.
Data is categorised in to 5 different dimensions:
1. Time
2. Person
3. Object
4. Location
5. Event
Figure 9.1
136
9.4 THE SUN MODEL
The use of sun models is a technique that enables the data scientist
to perform consistent dimension consolidation, by explaining the intended
data relationship with the business, without exposing it to the technical
details required to complete the transformation processing.
You must describe in detail what the missing value treatments are
for the data lake transformation. Make sure you take your business
community with you along the journey. At the end of the process, they
must trust your techniques and results. If they trust the process, they will
implement the business decisions that you, as a data scientist, aspire to
achieve.
The 5 Whys is the technique that helps you to get to the root cause of your
analysis. The use of cause-and-effect fishbone diagrams will assist you to
resolve those questions.I have found the following common reasons for
missing data:
• Data fields renamed during upgrades
• Migration processes from old systems to new systems wheremappings
were incomplete
• Incorrect tables supplied in loading specifications by subject-
matterexpert
• Data simply not recorded, as it was not available
• Legal reasons, owing to data protection legislation, such as theGeneral
Data Protection Regulation (GDPR), resulting in a not-to-processtag
on the data entry
137
• Someone else’s “bad” data science. People and projects makemistakes,
and you will have to fix their errors in your own datascience.
There are many tests for hypothesis testing, but the following two
are most popular.
There are two types of chi-square tests. Both use the chi-square
statistic and distribution for different purposes:
138
bearing (control) mice. Statistical procedures used for this analysis include
a t-test, ANOVA, Mann–Whitney U test, Wilcoxon signed-rank test, and
logistic regression. These tests are used to individually or globally screen
the measured metabolites for an association with a disease.
139
outcome variables to achieve a linear relationship between the modified
input and outcome variables.
Linear regression is a useful tool for answering the first question, and
logistic regression is a popular method for addressing the second.
Clustering algorithms/Methods:
There are several clustering algorithms/Methods available, of which we
will be explaining a few:
● Connectivity Clustering Method: This model is based on the
connectivity between the data points. These models are based on the
notion that the data points closer in data space exhibit more similarity
to each other than the data points lying farther away.
141
● Clustering Partition Method: it works on divisions method, where
the division or partition between data set is created. These partitions
are predefined non-empty sets. This is suitable for a small dataset.
● Centroid Cluster Method: This model revolve around the centre
element of the dataset. The closest data points to the centre data point
(centroid) in the dataset is considered to form a cluster. K-Means
clustering algorithm is the best fit example of such model.
● Hierarchical clustering Method: This method describes the tree
based structure (nested clusters) of the clusters. In this method we have
clusters based on the divisions and their sub divisions in a hierarchy
(nested clustering). The hierarchy can be pre-determined based upon
user choice. Here number of clusters could remain dynamic and not
needed to be predetermined as well.
● Density-based Clustering Method: In this method the density of the
closest dataset is considered to form a cluster. The more number of closer
data sets (denser the data inputs), the better the cluster formation. The
problem here comes with outliers, which is handled in classifications
(support vector machine) algorithm.
9.14 ANOVA
Formula of ANOVA:
F= MST / MSE
Where, F = ANOVA Coefficient
MSE = Mean sum of squares due to treatment
MST = Mean sum of squares due to error
The ANOVA test is the initial step in analysing factors that affect a
given data set. Once the test is finished, an analyst performs additional
testing on the methodical factors that measurably contribute to the data
set's inconsistency. The analyst utilizes the ANOVA test results in an f-
142
test to generate additional data that aligns with the proposed regression
models.
(citation: https://www.investopedia.com/terms/a/anova.asp,
https://www.statisticshowto.com/probability-and-statistics/hypothesis-
testing/anova/#:~:text=An%20ANOVA%20test%20is%20a,there's%20a%
20difference%20between%20them. )
PCA is actually a widely covered method on the web, and there are
some great articles about it, but only few of them go straight to the point
and explain how it works without diving too much into the technicalities
and the ‘why’ of things. That’s the reason why i decided to make my own
post to present it in a simplified way.
(Citations: https://builtin.com/data-science/step-step-explanation-
principal-component-analysis )
Decision trees predict the future based on the previous learning and
input rule sets. It taken multiple input values and returns back the probable
output with the single value which is considered as a decision. The
input/output could be continuous as well as discrete. A decision tree takes
its decision based on the defined algorithms and the rule sets.
For example you want to take a decision to buy a pair of shoes. We start
with few set of questions:
1. Do we need one?
2. What would be the budget?
3. Formal or informal?
4. Is it for a special occasion?
5. Which colour suits me better?
6. Which would be the most durable brand?
7. Shall we wait for some special sale or just buy one since its needed?
Example:
144
classification of the inputs received on the basis of the rule-set. It also
works on Regression problems.
Scene one:
Figure 9.3
The above scene shows A, B and C as 3 line segments creating
hyper planes by dividing the plane. The graph here shows the 2 inputs
circles and stars. The inputs could be from two classes. Looking at the
scenario we can say that A is the line segment which is diving the 2 hyper
planes showing 2 different input classes.
Scene two:
Figure 9.4
In the scene 2 we can see another rule, the one which cuts the
better halves is considered. Hence, hyper plane C is the best choice of the
algorithm.
145
Scene three:
Figure 9.5
Here in scene 3, we see one circle overlapping hyper plane A ,
hence according to rule 1 of scene 1 we will choose B which is cutting the
co-ordinates into 2 better halves.
Scene four:
Figure 9.6
Scene 4 shows one hyper plane dividing the 2 better halves but
there exist one extra circle co-ordinate in the other half hyperplane. We
call this as an outlier which is generally discarded by the algorithm.
Scene five:
Figure 9.7
Scene 5 shows another strange scenario where we have the co-
ordinates at all 4 quadrants. In this scenario we will fold the x-axis and cut
146
y axis into two halves and transfer the stars and circle on one side of the
quadrant and simplify the solution. The representation is shown below:
Figure 9.8
This gives us again a chance to divide the 2 classes into 2 better
halves with using a hyperplane. In the above scenario we have scooped
out the stars from the circle co-ordinates and shown it as a different
hyperplane.
Neural Networks:
Artificial Neural Networks; the term is quite fascinating when any student
starts learning it. Let us break down the term and know its meaning.
Neural = comes from the term Neurons in the brain; a complex structure
of nerve cells with keeps the brain functioning. Neurons are the vital part
of human brain which does simple input/output to complex problem
solving in the brain.
Well we all have one 1200 gms of brain (appx). Try weighing
it.?..That’s a different thing that few have kidney beans Inside their
skull.??For those who think they have a tiny little brain. Let me share
certain things with you.
148
Here, the neuron is actually a processing unit, it calculates the
weighted sum of the input signal to the neuron to generate the activation
signal a, given by :
X1 W1
a=
X2 W2 Sum of
..... ..... inputs ∑
..Xn .Wn
149
9.18 TENSORFLOW
*****
150
10
ORGANIZE AND REPORT SUPERSTEPS
Unit Structure
10.1 Organize Superstep
10.2 Report Superstep
10.3 Graphics, Pictures
10.4 Unit End Questions
Horizontal Style:
That is, the data science tool can see the complete record for the records in
the subset of records.
Vertical Style:
151
That is, the data science tool can see only the preselected columns
from a record for all the records in the population.
Island Style:
152
If the two items are statistically independent, then P(x,y) =
P(x)P(y), corresponding to Lift = 1, in that case. Note that anti-correlation
yields lift values less than 1, which is also an interesting discovery,
corresponding to mutually exclusive items that rarely co-occur.
You will require the following additional library: conda install -c conda-
forge mlxtend.
The general algorithm used for this is the Apriori algorithm for
frequent item set mining and association rule learning over the content of
the data lake. It proceeds by identifying the frequent individual items in
the data lake and extends them to larger and larger item sets, as long as
those item sets appear satisfactorily frequently in the data lake.
Eliminate Clutter:
And so further.
Channels of Images
154
The interesting fact about any picture is that it is a complex data set in
every image.
Pictures are built using many layers or channels that assists the
visualization tools to render the required image.
Open your Python editor, and let’s investigate the inner workings of an
image.
*****
155