Intro To DA, Data Sources and Representation
Intro To DA, Data Sources and Representation
Intro To DA, Data Sources and Representation
UE21CS342AA2
UNIT-1
Lecture 1 : Introduction to DA , Data
Sources and Representations
Department of Computer Science and Engineering
Data Analytics
Unit 1
Lecture 1 : Introduction to Data Analytics , Data Sources and
Representations
DATA ANALYTICS
Evaluation Policy (tentative)
Component Description Weight
ISA 1 CBT 40 marks scaled to 20 (Unit 1 and Unit 2) 20
ISA-2 CBT 40 marks scaled to 20 (Unit 3 and Unit 4) 20
Experiential Learning: • Worksheets/ Assignments ( 1 per unit, with multiple parts) 5
Worksheets • Hackathon (1 for the entire course): 5 marks
(Lab + Assignment) 2 mark for participation + 2 marks for working
+ Hackathon + 1 mark for finishing in the top 30% of the leaderboard: 2 marks (for those
not in the top 30%, class participation (attendance + questions asked,
answered) in the lecture hours + invited talk can count towards the 1 mark)
6 marks: hackathon
2 marks - timely submission that follows instructions
2 marks - working (includes their similarity score using MOSS tool with a reasonable threshold)
(for those who get a high similarity score, a chance can be given through a viva-voce;
if they have understood what they submitted and can answer a few questions, they can
get these 2 marks)
2 marks - for being in the top 30% of the leader board
(for those who miss this any additional assignment/ task can be defined by each faculty
for their class to allow students to earn those points; a case study presentation has been
included in the slides as an example only)
Hope this is fair and acceptable to everyone on the team.
About ISA/ ESA - A reference list of questions/ question for each Unit from which questions will be
asked.
DATA ANALYTICS
What is data analytics?
⚫ The science of examining raw data to elicit patterns, develop insights , and
draw conclusions to help take a business decision.
⚫ The need :
Business decisions are very complex. There exist several alternate solutions,
complex interdependent factors and lack of available time to take a decision.
⚫ Analysis vs analytics
⚫ Analysis – Examining and understanding past data.
⚫ Analytics – Analysis + forecasting (or predictive modeling).
DATA ANALYTICS
Pyramid of analytics applications and Data driven decision making
DATA ANALYTICS
What does it involve?
DATA ANALYTICS
Why is it important?
… and more!
DATA ANALYTICS
Few examples
And…
a secret ingredient
Intuition or
deductive reasoning
and domain knowledge
DATA ANALYTICS
Case Study
Technology:
To find out whether a customer has forgotten to place an order
for an item
The objective of the data science component of analytics is to identify the most
appropriate statistical model/machine learning algorithm that is best based on a
measure of accuracy.
• What is a Domegemegrottebyte?
DATA ANALYTICS
How large is big (data)?
• 1 bit
• 1 byte = 8 bits
• 1 KB = 1024 bytes
• 1 MB = 1024 KB (kilobytes)
• 1 GB = 1024 MB (megabytes)
• 1 TB = 1024 GB (gigabytes) ≈ 1012 bytes
• 1 PB = 1024 TB (terabytes) ≈ 1015 bytes
20 PB = amt of data processed by Google per day!
• 1 EB = 1024 PB (petabytes)
• 1 ZB = 1024 EB (exabytes)
• 1 YB = 1024 ZB (zettabytes)
• What is a Domegemegrottebyte?
DATA ANALYTICS
Data Sources
characteristic of an object.
• Examples: eye color of a person, 1 Yes Single 125K No
2 No Married 100K No
Objects
temperature, etc.
• Attribute is also known as variable, 3 No Single 70K No
vector
■ Transaction data
■ Molecular Structures
DATA ANALYTICS
Data Representations
■ Ordered
■ Video data: sequence of images
sequences
■ Genetic sequence data
■ Image data
■ Video data
DATA ANALYTICS
Data Representations-Record Data
■ Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Pl w
team coach ball score game lost timeout season
ay i
n
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
DATA ANALYTICS
Data Representations-Transaction data
■ A special type of record data, where
■ each record (transaction) involves a set of items.
TID Items
1 Bread,Coke,Milk
2 Beer,Bread
3 Beer,Coke,Diaper,Milk
4 Beer,Bread,Diaper,Milk
5 Coke,Diaper,Milk
DATA ANALYTICS
Data Representations
■ Graph Data
■ Examples: Generic graph and HTML Links
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
■ Chemical Data
■ Benzene Molecule: C6H6
DATA ANALYTICS
Data Representations-Ordered Data
■ Sequences of transactions ■ Spatio-Temporal Data
Items/Events
• Which OLAP operation are you likely to perform at the end of the financial
year?
Roll-Up
• https://www.geeksforgeeks.org/data-cube-or-olap-approach-in-
data-mining/
DATA ANALYTICS
UE21CS342AA2
UNIT-1
Lecture 2 : The R programming
environment and descriptive statistics
• R also has charting capabilities, which means you can plot your
data and create interesting visualizations from any dataset.
DATA ANALYTICS
Applications of R
• R has been used primarily in academics and research and is great for
exploratory data analysis.
packages.
dplyr.
DATA ANALYTICS
Preference by Industry
DATA ANALYTICS
History of R
• The S language has been developed since the late 1970s by John
Chambers and colleagues at Bell Labs as a language for programming
with data.
• S language combines ideas from a variety sources (awk, lisp, APL,) and
provides an environment for quantitative computations and
visualization.
• Provides an explicit and consistent structure for manipulating, analyzing
statistically and visualizing data.
• S-Plus is a commercialization of the Bell Labs framework. It is “S" plus
“graphics".
• R is a free implementation of a dialect of the S language
DATA ANALYTICS
History of R
• CRAN also hosts many add-on packages that can be used to extend the
functionality of R. Over 6,789 packages are available on CRAN that have
been developed by users and programmers around the world.
DATA ANALYTICS
Getting Started
Installing R on Windows:
• To install R on your Windows computer, follow the below steps :
• Go to https://cran.rstudio.com/bin/windows/base/
• Alternative : http://ftp.heanet.ie/mirrors/cran.r-project.org
• Under “Download and Install R” , click on the “Windows” link.
• Download the required.exe file. The latest version of R is 4.3.1. Click on the
link which says “Download R 4.3.1 for windows.”
• After downloading, double click on the executable to run it.
• Choose English as the language to install it.
DATA ANALYTICS
Download and install RStudio on Windows
Visit https://www.rstudio.com/products/rstudio/download/
and click on DOWNLOAD RSTUDIO DESKTOP
List of IDEs
■ RStudio
■ StatET for R (eclipse based)
■ R-Brain IDE (RIDE)
■ IntelliJ IDEA
■ R Tool for Visual Studio
DATA ANALYTICS
Setting your Working Directory
R getwd() Function
• Working directory is the directory where R finds all R files for
reading and writing.
• getwd() function returns an absolute file path representing the
current working directory of the R process.
• getwd() Output:"C:/Users/*****/Documents“
R setwd() Function
• setwd(dir) is used to set the working directory to dir.
• setwd(“e:/folder”) Output:Sets pwd to “e:/folder”
DATA ANALYTICS
Setting your Working Directory
R dir() Function
• dir() function lists all the files in a directory.
• dir() Output:Lists all files in the pwd
R ls() Function
• ls() is a function in R that lists all the
object in the working environment.
• ls() Output:Lists all objects in the pwd
• It can be used in the scenario where you want to clean the environment
before running the code.The following command will remove all the object
from R environment.
• rm(list = ls())
DATA ANALYTICS
Getting help with R
• data()
• library(ggplot2)
• After loading a package, the functions exported by that package will
be attached to the top of the search() list (after the workspace).
• library(ggplot2)
• search()
• To save your workspace to a file, you may type save.image() or use
Save Workspace in the File menu
• The default workspace file is called .RData
Data Analytics
Unit 1
Descriptive Statistics
DATA ANALYTICS
Revisiting types of data attributes
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye color, contingency
nominal attributes provide only sex: {male, female} correlation, χ2 test
enough information to distinguish
one object from another. (=, ≠)
Ordinal The values of an ordinal attribute hardness of minerals, median,
provide enough information to order {good, better, best}, percentiles, rank
objects. (<, >) grades, street numbers correlation, run
tests, sign tests
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
DATA ANALYTICS
How is Interval different to Ratio?
https://www.questionpro.com/blog/ratio-scale-vs-interval-
scale/
DATA ANALYTICS
Transformations on Data
With this data , we can conclude that both classes are equally smart as
their average IQs are roughly the same.
Mean is the arithmetical average value of the data and is one of the
most used measures of average tendency.
If the data is captured in frequencies, then use:
Median is not calculated using the entire dataset like mean. We are simply
looking for the midpoint rather than using the values of the entire data.
However , it is more stable than mean as adding a new observation doesn’t
change the median significantly.
DATA ANALYTICS
Mode
• For example , assume that there is student data with students’ mode of
transport , namely car , bus , two wheeler or the metro. Mean and Median are
meaningless to analyze nominal data such as mode of transport.
• It is possible for a dataset to not have a mode at all. It occurs when each value
of the dataset appears equal number of times.
DATA ANALYTICS
Test your understanding!
• Which central tendency (when applicable and exists) provides a value which
is present in the dataset?
Mode
• Which central tendency is not robust to outliers?
Mean
• When the dataset contains outliers, which measure of central tendency is
the most appropriate to use?
Median
• Which function in R is used to load a package?
library()
DATA ANALYTICS
References
⦁ Quartile divides the data into 4 equal parts. The first quartile (Q1)
contains first 25% of the data, Q2 contains 50% of the data and is
also the median. Quartile 3 (Q3) accounts for 75% of the data.
DATA ANALYTICS
Example
2. Note that the data in the table is arranged in increasing order of columns.
The position of P10 = 10 × (51)/100 = 5.1 . We can round off 5.1 to its
nearest integer 5. The corresponding value from the table is 21. (10 % of
the observations have a value of less than or equal to 21. That is by 21
hours, 10% of the wire-cuts will fail.
Instead of rounding the value obtained from the equation , we can use the
following approximation: Value at 5th position is 21. Value at position 5.1 is
approximated as 21 + (0.1 * (value at 6th position – value at 5th position))
= 21 + (0.1 * 1) = 25.1
DATA ANALYTICS
Solution
• Solution:
𝑛(𝑛−1)
• The value of will converge to 1 as the value of n
𝑛−2
increases.
− 4
σ4𝑖=1 𝑋𝑖 −𝑋 /𝑛
• Kurtosis =
𝜎4
− 4
σ4𝑖=1 𝑋𝑖 −𝑋 /𝑛
• Excess Kurtosis = −3
𝜎4
• The value of is
a) Zero for any sample
b) Zero for population but not necessarily for samples
c) Zero for both samples and population
d) Cannot say
Zero for both samples and population
3) Consistency : Should not contain any discrepancies in the data or the naming
convention of the attributes.
Examples of inconsistency:
• Age is recorded as 50 but Date of Birth = 03/04/2005.
• In the result column of students’ marks , few entries are in GPA format
and rest in percentage.
• Discrepancies can exist between duplicate records.
DATA ANALYTICS
Measures of Data Quality
4) Timeliness : The data must be updated in a timely fashion. For example , for
an analysis run on the first day of every month, previous month’s data must
be up to date for accurate analysis.
4) Believability : The data and its source must be trusted by the users. If this
data or the source caused problems in the past , current users will find it
hard to trust it.
NOTE : The quality of data is subjective and depends on the intended use of
data. The data needs of each problem is different.
DATA ANALYTICS
Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
DATA ANALYTICS
Data Cleaning
• Equipment malfunction
• Inconsistent with other recorded data leading to its deletion
• Data not recorded due to a misunderstanding
• Certain data may not be considered useful at the time of entry
NOTE : There is no statistical way to determine under which category your missing data
will fall under.
DATA ANALYTICS
Types of Missing Values-A Quick Glance
Missing at Random, MAR, means there is a systematic relationship between the propensity
of missing values and the observed data, but not the missing data.
Whether an observation is missing has nothing to do with the missing values, but it does
have to do with the values of an individual’s observed variables. So, for example, if men are
more likely to tell you their weight than women, weight is MAR.
Missing Not at Random, MNAR, means there is a relationship between the propensity of a
value to be missing and its values.
DATA ANALYTICS
An Interesting Thought
• Imagine you are collecting some information from your classmates. For many
reasons , not everyone will answer every question of yours. And that is okay!
• Well the next step is replacing missing values right? We can use any one of
the methods we have discussed till now after some analysis of the data.
• But wait! Don’t you think the fact that they did not answer is some kind of
information which can be beneficial to our analysis?
• So the next time you build a model , before dealing with the missing values ,
create an additional variable ( preferably a binary variable ) in which you
store if the particular student answered or not.
• This may (or may not! ) help you gain more insights about the population or
improve the analytics model you are building!
https://youtu.be/f9AQy7p0QEo
DATA ANALYTICS
Noisy data
Noise is a random error or variance in a measured variable.
Data smoothening techniques to combat noise :
• Binning
▪ Sort the data and partition into bins(equal-width, equal-frequency, etc.)
▪ Smooth by bin means, bin medians, by bin boundaries etc.
▪ More on binning in further lectures.
• Regression - Data can be smoothened by fitting it to a regression model.
• Clustering - Outliers can be detected with the help of clustering and can be
removed to smoothen the data.
• Combined computer and human inspection – Computer detects suspicious
values and is validated by a human. Is useful when dealing with possible
outliers.
DATA ANALYTICS
Outliers
Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set.
• Data transformation
• Some data inconsistencies can be corrected manually but most errors require
data transformations.
• Data migration tools allow transformations to be specified.
• ETL (Extraction/Transformation/Loading) tools allow users to specify
transformations through a graphical user interface.
• Data transformations may introduce more discrepancies.
• The 2-step process of discrepancy detection and data transformation occurs
iteratively until no further anomalies are found.
• New approaches to data cleaning emphasize increased interactivity. Potter’s
wheel is a publicly available data cleaning tool that integrates both the steps.
DATA ANALYTICS
Test your understanding!
• Which of these is not a method to deal with noisy data?
a) Binning
b) Regression
c) Principal Component Analysis
d) Clustering
Solution
c) Principle Component Analysis
• Outliers need to be removed in every dataset , regardless of the problem
statement.
Solution
False
• Mean imputation can be done for which type of missing data?
Solution
MCAR
DATA ANALYTICS
Test your understanding!
• The statement “Most of the missing people from work are sickest people” denotes
what type of missingness?
MNAR
• Data Mining : Concepts and Techniques by Han, Kamber and Pei , The
Morgan Kaufmann Series in Data Management Systems ,3rd Edition
Chapter : 3.1-3.2
• http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/mi.html
• https://www.scribbr.com/statistics/missing-data/
• https://www.theanalysisfactor.com/missing-data-mechanism/