Exam-1

Week 1 (Introduction)
***Types of Analytics (Also see 33, 34,35 Slides):

Descriptive Analytics:
Descriptive analytics focuses on understanding historical data to identify patterns and trends. It
answers the question "What happened?" by summarizing past events using techniques like data
aggregation and reporting. This type of analytics is often represented through dashboards, charts, or
reports to provide insights into performance or operations. Example: Monthly sales trends in a company.
Predictive Analytics:
Predictive analytics uses statistical models and machine learning algorithms to analyze historical data
and predict future outcomes. It answers the question "What could happen?" by identifying trends, risks,
or opportunities. This is often used in forecasting and risk assessment, enabling organizations to
anticipate changes and make informed decisions. Example: Predicting customer churn based on past
behavior.
Prescriptive Analytics:
Prescriptive analytics goes beyond predictions by suggesting actionable steps to achieve desired
outcomes. It answers "What should be done?" by applying optimization algorithms and simulations. This
type of analytics helps organizations make data-driven decisions, optimizing processes or strategies to
meet specific goals. Example: Recommending the most efficient route for delivery trucks in a logistics
network.
***Inductive Reasoning Vs Deductive Reasoning
Data Analytics follow Inductive reasoning
If you already have have data, then it is Inductive reasoning
If you have to collect data to solve a hypothesis then it is deductive reasoning

***What are the different steps in CRISP-DM Methodology? (Essay Type Question) (Slide 38)
***Describe all the steps in CRISP-DM (Question)
The CRISP-DM (Cross Industry Standard Process for Data Mining) methodology is a widely used
framework for data mining and analytics projects. It consists of six major steps, each of which guides the
process from understanding the business problem to deploying a final solution. The steps are:
1. Business Understanding:
Goal: Clearly define the project’s objectives and requirements from a business perspective.
Activities: Understand the business problem, translate it into a data science problem, and outline the
success criteria.
Output: A clear business problem definition and the project’s scope.
2. Data Understanding:
Goal: Collect, assess, and explore the data to get familiar with it.
Activities: Initial data collection, data quality checks, descriptive statistics, and exploratory data analysis.
Output: Data description, data quality report, and initial insights.
3. Data Preparation:
Goal: Prepare the data for analysis and modeling.
Activities: Data cleaning, transformation, feature selection, and creating the final dataset that will be
used for modeling.
Output: Clean, structured, and formatted dataset ready for modeling.
4. Modeling:
Goal: Apply various modeling techniques to the prepared data to address the business problem.
Activities: Select modeling techniques, build models, tune model parameters, and validate model
performance.
Output: A set of predictive or descriptive models, along with performance assessments.
5. Evaluation:
Goal: Assess the models to ensure they meet business objectives and provide meaningful results.
Activities: Evaluate the models based on performance metrics, review the steps taken, and verify if the
results are valid and useful.
Output: Evaluation report and decision on whether the model is ready for deployment.
6. Deployment:
Goal: Implement the results of the data mining process into the business.
Activities: Deploy the model or insights into production, create reporting systems, monitor performance,
and provide maintenance if needed.
Output: Final model or solution in use, possibly with documentation and reports for the end-users.
These steps are iterative, meaning that after completing one step, you may return to previous steps to
refine and improve your results. The flexibility of CRISP-DM makes it suitable for a variety of industries
and data-related projects.
Big Data Definition (3Vs)
• 3 Vs
• Volume
• Velocity: requiring quick response times to derive insights and make decisions.
• Variety
Week 2-Descriptive Analysis (Visualization)

***Variable Types – (and difference between this)
Structured/Unstructured
Nominal, ordinal, interval, ratio
Here's a breakdown of different variable types:
1. Structured vs. Unstructured Data:
 Structured Data:
o Definition: Data that is highly organized and formatted in a consistent way, making it
easy to store, search, and analyze. It typically resides in databases or spreadsheets.
o Examples: Customer databases, sales records, sensor data, data in relational databases.
o Use: Structured data can be easily processed using SQL queries, statistical methods, or
machine learning algorithms.
o Key Characteristic: Data is organized in rows and columns with predefined formats (e.g.,
numbers, strings, dates).
 Unstructured Data:
o Definition: Data that lacks a specific structure or organization, making it harder to

analyze and process. This includes data that doesn't fit neatly into a relational database.
o Examples: Text documents, emails, social media posts, images, audio, and video files.
o Use: Requires specialized techniques like text mining, natural language processing (NLP),
or machine learning to extract insights.
o Key Characteristic: Lacks a predefined format, and the data is more complex to
interpret.
2. Nominal, Ordinal, Interval, Ratio Variables:
 Nominal Variables (Categorical):
o Definition: Variables that represent categories or groups with no inherent order.
o Examples: Gender (male/female), types of fruits (apple, banana, orange), colors (red,
blue, green).
o Key Characteristic: No ranking or meaningful order between the categories.
 Ordinal Variables:
o Definition: Variables that represent categories with a meaningful order or ranking, but
the intervals between values are not necessarily consistent or measurable.
o Examples: Education level (high school, bachelor’s, master’s), customer satisfaction (low,
medium, high), ranks in a race (1st, 2nd, 3rd).
o Key Characteristic: There is an inherent order, but differences between the categories
cannot be quantified.
 Interval Variables:
o Definition: Variables that have meaningful, equal intervals between values, but no true
zero point.
o Examples: Temperature in Celsius or Fahrenheit, IQ scores, dates on a calendar.
o Key Characteristic: Differences between values are meaningful, but there is no absolute
zero (e.g., 0°C does not mean the absence of temperature).
 Ratio Variables:
o Definition: Variables that have meaningful intervals and a true zero point, which allows
for both comparison and multiplication/division of values.
o Examples: Height, weight, age, income, temperature in Kelvin.
o Key Characteristic: True zero allows for meaningful comparisons such as “twice as
much” or “half as much."
***Key Differences:
 Nominal vs. Ordinal: Nominal has no inherent order, while ordinal has a ranked order.
 Interval vs. Ratio: Interval lacks a true zero point, while ratio variables have a true zero, making
ratios between values meaningful.
Week 3-Visualizations cont

No Questions
Week 4 - Forecast, Dashboards and Story

Moving Average Model
Weighted Moving Average
***Exponential Smoothing
Exponential smoothing models perform the best. Why?***
-It adjusts for error
Alpha (smoothing error)
1. Moving Average Model (MA)
 Definition: A Moving Average Model is a method used to smooth out data by averaging
a set number of past values. It helps reduce short-term fluctuations and shows the
overall trend.
 Example: If you're looking at stock prices over 5 days, you average the prices of the last 5
days to get a smoother value for the 6th day.
2. Weighted Moving Average (WMA)
 Definition: A Weighted Moving Average is similar to a regular moving average, but here,
more recent values are given more importance (or weight). Instead of treating all past
values equally, it assigns bigger weights to more recent data points.
 Example: In a 5-day stock price example, you give more importance to the latest prices
(like day 5 and 4) than to older ones (day 1, 2, 3).
Exponential smoothing:
Next forecast = Previous forecast + α(actual-previous forecacst)
α is percentage of error
Suppose the previous forecast was 42 units but actual demand came to be 40 and α is 0.1.
New forecast = 42 + 0.1 (40-42) = 41.8
Then if actual demand turns to be 43, then next forecast will
=41.8+0.10(43-41.8)= 41.92
Definition: Exponential Smoothing is a forecasting method that gives more importance to
recent data points while still considering older data. It does this by using a smoothing factor (a
number between 0 and 1) to decide how much weight to put on recent data.
***Forecast Accuracy matrics (see slide 11-MUST)
Week 5 – Network Analysis

***Data Representation
1. Adjacency Matrix
 Definition: An adjacency matrix is a 2D array (or matrix) used to represent a graph. Each
element in the matrix indicates whether there is an edge between a pair of nodes.
If we have N number of nodes, size of the matrix NxN. (Undirected network)
Symmetric matrix
Degree of node= row sum or column sum
Number of edges= sum any half
Directed graph: Adjacency Matrix
In degree= row sum
Out degree= column sum
How many edges are there in the directed network? Sum entire matrix (not half)
2. Edge List
 Definition: An edge list is a collection of all the edges in the graph, where each edge is
represented as a pair (or tuple) of nodes. If the graph is weighted, the list may include
the weights of the edges as well.
 Properties:
o Each row in the list shows two nodes that are connected and optionally the
weight.
o This representation is more space-efficient for sparse graphs (graphs with fewer
edges).
 Example:
mathematica
Copy code
Edges: (A, B), (B, C), (A, C)
Weighted: (A, B, 2), (B, C, 3), (A, C, 1)
3. Adjacency List
 Definition: An adjacency list represents a graph by maintaining a list of adjacent vertices

for each vertex in the graph. Each node has a list (or array) of nodes it is connected to.
 For a directed network, the outgoing adjacent vertices
 Properties:
o Suitable for sparse graphs because it only stores the neighbors of each node.
o For directed graphs, the adjacency list indicates only outgoing edges.
o Can be implemented as an array of lists or dictionaries.
 Example:
css
Copy code
A -> [B, C]
B -> [A, C]
C -> [A, B]
Comparison:
 Adjacency Matrix: Best for dense graphs (graphs with lots of edges). It allows quick
lookups for whether two nodes are connected, but it uses O(n2)O(n^2)O(n2) space,
even for sparse graphs.
 Edge List: Space-efficient for sparse graphs. Finding if two nodes are connected may
require checking the entire list.
 Adjacency List: Combines the advantages of both adjacency matrices and edge lists, and
is typically more space-efficient for sparse graphs while allowing quick access to
neighbors.
Types of Network***(Essay type, short or multiple question)
Directed/Undirected Network
Random Network-Property: Normal degree distribution
Small World Phenomenon: If a network has lots of high clustering coefficient then it’s a small
world phenomenon
Power law/Scale free network: If degree distribution is power law, it is also called long tail
distribution, that network is scale free network
1. Random Network:
 A network where connections between nodes are made randomly, and the degree
distribution of nodes follows a Poisson distribution, meaning most nodes have a similar
number of connections.
2. Small World Phenomenon:
 A network characteristic where most nodes can be reached from any other node in just
a few steps, despite the network being large. It combines short path lengths with high
clustering of nodes.
3. Power Law / Scale-Free Network:

 A network where a few nodes (hubs) have many connections, while most nodes have
very few. The degree distribution follows a power law, meaning a small number of nodes
dominate in terms of connectivity.
What is clustering coefficient?
Cluster coefficient: How cluster a node is around its neighbor. How well my neighbor also
connected each other directly.
Max connection=n(n-1)/2
Actual=
Week 6 – Network Analytics Cont.

Multiple types of questions will come
• Network Properties/Centralities
• Degree centrality (node level property) and average degree: directed network
***Degree Centrality: Number of nodes directly connected to the particular node.
Average Degree refers to the average number of connections (edges) each node has
in a network
Average Degree:
 Definition: Average Degree refers to the average number of connections (edges) each
node has in a network. It gives an idea of how connected the network is overall.
 Formula: Average Degree=2×Total number of edgesTotal number of nodes\text{Average

Degree} = \frac{2 \times \text{Total number of edges}}{\text{Total number of
nodes}}Average Degree=Total number of nodes2×Total number of edges
 Example: If a network has 10 nodes and 30 edges, the average degree is 2×3010=6\
frac{2 \times 30}{10} = 6102×30=6, meaning each node has an average of 6 connections.
• Degree Distribution (network level property)
How the degrees are structured here and how the degrees are spread
• Closeness***
How close the node is with other nodes
• Betweenness***
Bridge between multiple other nodes.
How many times a node is on the shortest path of all other pairs of nodes.
How many pairs possible?= n(n-1)/2
• Distance, Diameter and Average Path Length*** (definition)
Diameter of a network is the longest of all short distances

• Network Problems
• Cliques***: is a part of the network, where every other directly connected to each other
• Connected Components***
• Structured holes***
• Bipartite networks*** no connection between own kind node but connected through
another kind of nodes
• Individualism vs. Structuralism***
Unit of Analysis: An Individual Relations
Individual capital social capital
Micro-level Analysis Macro-level analysis

Exam-1

Uploaded by

Copyright:

Available Formats

Exam-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exam-1

Uploaded by

Copyright:

Available Formats

Week 1 (Introduction)

***Types of Analytics (Also see 33, 34,35 Slides):

***Inductive Reasoning Vs Deductive Reasoning

Data Analytics follow Inductive reasoning

If you already have have data, then it is Inductive reasoning

If you have to collect data to solve a hypothesis then it is deductive reasoning

***Describe all the steps in CRISP-DM (Question)

Output: A clear business problem definition and the project’s scope.

Output: Data description, data quality report, and initial insights.

Goal: Prepare the data for analysis and modeling.

Output: Clean, structured, and formatted dataset ready for modeling.

Output: A set of predictive or descriptive models, along with performance assessments.

Big Data Definition (3Vs)

Week 2-Descriptive Analysis (Visualization)

Nominal, ordinal, interval, ratio

Here's a breakdown of different variable types:

1. Structured vs. Unstructured Data:

o Definition: Data that lacks a specific structure or organization, making it harder to

2. Nominal, Ordinal, Interval, Ratio Variables:

 Nominal Variables (Categorical):

o Definition: Variables that represent categories or groups with no inherent order.

o Key Characteristic: No ranking or meaningful order between the categories.

o Examples: Temperature in Celsius or Fahrenheit, IQ scores, dates on a calendar.

o Examples: Height, weight, age, income, temperature in Kelvin.

Week 3-Visualizations cont

Week 4 - Forecast, Dashboards and Story

Weighted Moving Average

Exponential smoothing models perform the best. Why?***

-It adjusts for error

Alpha (smoothing error)

1. Moving Average Model (MA)

2. Weighted Moving Average (WMA)

Next forecast = Previous forecast + α(actual-previous forecacst)

New forecast = 42 + 0.1 (40-42) = 41.8

Then if actual demand turns to be 43, then next forecast will

***Forecast Accuracy matrics (see slide 11-MUST)

Week 5 – Network Analysis

If we have N number of nodes, size of the matrix NxN. (Undirected network)

Degree of node= row sum or column sum

Number of edges= sum any half

Directed graph: Adjacency Matrix

In degree= row sum

Out degree= column sum

Edges: (A, B), (B, C), (A, C)

Weighted: (A, B, 2), (B, C, 3), (A, C, 1)

 Definition: An adjacency list represents a graph by maintaining a list of adjacent vertices

 For a directed network, the outgoing adjacent vertices

o Can be implemented as an array of lists or dictionaries.

Types of Network***(Essay type, short or multiple question)

Random Network-Property: Normal degree distribution

2. Small World Phenomenon:

3. Power Law / Scale-Free Network:

What is clustering coefficient?

Week 6 – Network Analytics Cont.

***Degree Centrality: Number of nodes directly connected to the particular node.

 Formula: Average Degree=2×Total number of edgesTotal number of nodes\text{Average

How close the node is with other nodes

Bridge between multiple other nodes.

How many pairs possible?= n(n-1)/2

• Distance, Diameter and Average Path Length*** (definition)