Exam-1
Exam-1
Exam-1
Predictive Analytics:
Predictive analytics uses statistical models and machine learning algorithms to analyze historical data
and predict future outcomes. It answers the question "What could happen?" by identifying trends, risks,
or opportunities. This is often used in forecasting and risk assessment, enabling organizations to
anticipate changes and make informed decisions. Example: Predicting customer churn based on past
behavior.
Prescriptive Analytics:
Prescriptive analytics goes beyond predictions by suggesting actionable steps to achieve desired
outcomes. It answers "What should be done?" by applying optimization algorithms and simulations. This
type of analytics helps organizations make data-driven decisions, optimizing processes or strategies to
meet specific goals. Example: Recommending the most efficient route for delivery trucks in a logistics
network.
The CRISP-DM (Cross Industry Standard Process for Data Mining) methodology is a widely used
framework for data mining and analytics projects. It consists of six major steps, each of which guides the
process from understanding the business problem to deploying a final solution. The steps are:
1. Business Understanding:
Goal: Clearly define the project’s objectives and requirements from a business perspective.
Activities: Understand the business problem, translate it into a data science problem, and outline the
success criteria.
2. Data Understanding:
Goal: Collect, assess, and explore the data to get familiar with it.
Activities: Initial data collection, data quality checks, descriptive statistics, and exploratory data analysis.
3. Data Preparation:
Activities: Data cleaning, transformation, feature selection, and creating the final dataset that will be
used for modeling.
4. Modeling:
Goal: Apply various modeling techniques to the prepared data to address the business problem.
Activities: Select modeling techniques, build models, tune model parameters, and validate model
performance.
5. Evaluation:
Goal: Assess the models to ensure they meet business objectives and provide meaningful results.
Activities: Evaluate the models based on performance metrics, review the steps taken, and verify if the
results are valid and useful.
Output: Evaluation report and decision on whether the model is ready for deployment.
6. Deployment:
Goal: Implement the results of the data mining process into the business.
Activities: Deploy the model or insights into production, create reporting systems, monitor performance,
and provide maintenance if needed.
Output: Final model or solution in use, possibly with documentation and reports for the end-users.
These steps are iterative, meaning that after completing one step, you may return to previous steps to
refine and improve your results. The flexibility of CRISP-DM makes it suitable for a variety of industries
and data-related projects.
• 3 Vs
• Volume
• Velocity: requiring quick response times to derive insights and make decisions.
• Variety
Structured/Unstructured
Structured Data:
o Definition: Data that is highly organized and formatted in a consistent way, making it
easy to store, search, and analyze. It typically resides in databases or spreadsheets.
o Examples: Customer databases, sales records, sensor data, data in relational databases.
o Use: Structured data can be easily processed using SQL queries, statistical methods, or
machine learning algorithms.
o Key Characteristic: Data is organized in rows and columns with predefined formats (e.g.,
numbers, strings, dates).
Unstructured Data:
o Examples: Text documents, emails, social media posts, images, audio, and video files.
o Use: Requires specialized techniques like text mining, natural language processing (NLP),
or machine learning to extract insights.
o Key Characteristic: Lacks a predefined format, and the data is more complex to
interpret.
o Examples: Gender (male/female), types of fruits (apple, banana, orange), colors (red,
blue, green).
Ordinal Variables:
o Definition: Variables that represent categories with a meaningful order or ranking, but
the intervals between values are not necessarily consistent or measurable.
o Examples: Education level (high school, bachelor’s, master’s), customer satisfaction (low,
medium, high), ranks in a race (1st, 2nd, 3rd).
o Key Characteristic: There is an inherent order, but differences between the categories
cannot be quantified.
Interval Variables:
o Definition: Variables that have meaningful, equal intervals between values, but no true
zero point.
o Key Characteristic: Differences between values are meaningful, but there is no absolute
zero (e.g., 0°C does not mean the absence of temperature).
Ratio Variables:
o Definition: Variables that have meaningful intervals and a true zero point, which allows
for both comparison and multiplication/division of values.
o Key Characteristic: True zero allows for meaningful comparisons such as “twice as
much” or “half as much."
***Key Differences:
Nominal vs. Ordinal: Nominal has no inherent order, while ordinal has a ranked order.
Interval vs. Ratio: Interval lacks a true zero point, while ratio variables have a true zero, making
ratios between values meaningful.
***Exponential Smoothing
Definition: A Moving Average Model is a method used to smooth out data by averaging
a set number of past values. It helps reduce short-term fluctuations and shows the
overall trend.
Example: If you're looking at stock prices over 5 days, you average the prices of the last 5
days to get a smoother value for the 6th day.
Definition: A Weighted Moving Average is similar to a regular moving average, but here,
more recent values are given more importance (or weight). Instead of treating all past
values equally, it assigns bigger weights to more recent data points.
Example: In a 5-day stock price example, you give more importance to the latest prices
(like day 5 and 4) than to older ones (day 1, 2, 3).
Exponential smoothing:
α is percentage of error
Suppose the previous forecast was 42 units but actual demand came to be 40 and α is 0.1.
=41.8+0.10(43-41.8)= 41.92
Definition: Exponential Smoothing is a forecasting method that gives more importance to
recent data points while still considering older data. It does this by using a smoothing factor (a
number between 0 and 1) to decide how much weight to put on recent data.
1. Adjacency Matrix
Definition: An adjacency matrix is a 2D array (or matrix) used to represent a graph. Each
element in the matrix indicates whether there is an edge between a pair of nodes.
Symmetric matrix
How many edges are there in the directed network? Sum entire matrix (not half)
2. Edge List
Definition: An edge list is a collection of all the edges in the graph, where each edge is
represented as a pair (or tuple) of nodes. If the graph is weighted, the list may include
the weights of the edges as well.
Properties:
o Each row in the list shows two nodes that are connected and optionally the
weight.
o This representation is more space-efficient for sparse graphs (graphs with fewer
edges).
Example:
mathematica
Copy code
3. Adjacency List
Properties:
o Suitable for sparse graphs because it only stores the neighbors of each node.
o For directed graphs, the adjacency list indicates only outgoing edges.
Example:
css
Copy code
A -> [B, C]
B -> [A, C]
C -> [A, B]
Comparison:
Adjacency Matrix: Best for dense graphs (graphs with lots of edges). It allows quick
lookups for whether two nodes are connected, but it uses O(n2)O(n^2)O(n2) space,
even for sparse graphs.
Edge List: Space-efficient for sparse graphs. Finding if two nodes are connected may
require checking the entire list.
Adjacency List: Combines the advantages of both adjacency matrices and edge lists, and
is typically more space-efficient for sparse graphs while allowing quick access to
neighbors.
Directed/Undirected Network
Small World Phenomenon: If a network has lots of high clustering coefficient then it’s a small
world phenomenon
Power law/Scale free network: If degree distribution is power law, it is also called long tail
distribution, that network is scale free network
1. Random Network:
A network where connections between nodes are made randomly, and the degree
distribution of nodes follows a Poisson distribution, meaning most nodes have a similar
number of connections.
A network characteristic where most nodes can be reached from any other node in just
a few steps, despite the network being large. It combines short path lengths with high
clustering of nodes.
Cluster coefficient: How cluster a node is around its neighbor. How well my neighbor also
connected each other directly.
Max connection=n(n-1)/2
Actual=
• Network Properties/Centralities
• Degree centrality (node level property) and average degree: directed network
Average Degree refers to the average number of connections (edges) each node has
in a network
Average Degree:
Definition: Average Degree refers to the average number of connections (edges) each
node has in a network. It gives an idea of how connected the network is overall.
Example: If a network has 10 nodes and 30 edges, the average degree is 2×3010=6\
frac{2 \times 30}{10} = 6102×30=6, meaning each node has an average of 6 connections.
• Degree Distribution (network level property)
How the degrees are structured here and how the degrees are spread
• Closeness***
• Betweenness***
How many times a node is on the shortest path of all other pairs of nodes.
• Cliques***: is a part of the network, where every other directly connected to each other
• Connected Components***
• Structured holes***
• Bipartite networks*** no connection between own kind node but connected through
another kind of nodes