Business Intelligence Overview

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Business Intelligence Overview (SIT 303)

Context
Decision-making at all organizational levels impacts performance.
Decisions can be strategic or tactical; collectively, they define daily operations.
Better decision-making leads to enhanced efficiency, profitability, and customer satisfaction,
which is the core function of Business Intelligence (BI).
History of BI
Early computers were designed for specific calculations, leading to the development of
information storage.
Post-1940s, mass storage and Database Management Systems (DBMS) evolved to handle
growing data.
Businesses recognized the analytical value of data, leading to the creation of data warehouses
for unified data views.
Transactional systems, like Point-of-Sale (POS), were developed for daily operations, followed
by decision support systems and BI models, such as Microsoft Power BI and Tableau.
Definition of Business Intelligence
BI refers to software that processes business data into user-friendly formats (reports,
dashboards).
It encompasses various data types and helps organizations gain insights for decision-making.
BI enables companies to examine trends and make informed predictions.
Key Functions of BI
Data Conversion: Raw data → Information → Insights → Decisions.
BI Big Four:
Accurate Answers: Must reflect organizational reality.
Valuable Insights: Impactful information for organizational success.
Timely Information: Current data for effective decision-making.
Actionable Conclusions: Insights should lead to concrete actions.
BI Value Proposition
BI empowers organizations to ask critical questions and make data-driven decisions.
It provides insights into past and current performance, aiding strategic planning and resource
allocation.
BI enhances customer understanding, operational monitoring, and supply chain management.
BI Lifecycle
Planning: Define goals, identify data sources, select tools.
Data Acquisition: Collect, clean, and integrate data.
Data Modeling: Create structures to manage BI data.
Data Analysis: Identify trends and share insights.
Reporting and Visualization: Present data clearly for decision-making.
Deployment and Maintenance: Implement solutions and ensure data accuracy.
Big Data and BI Nexus
Big Data Characteristics: Volume, Velocity, Variety, Veracity, Value.
Big Data provides the raw material; BI processes convert this data into actionable insights.
BI tools help identify trends, predict future events, and improve customer service.
Types of Analytics
Descriptive: What happened?
Predictive: What might happen?
Prescriptive: What should be done?
Diagnostic: Why did it happen?

Summary of Data Warehousing


Introduction
Definition: A data warehouse is a separate data repository from operational databases,
designed to integrate and analyze historical data to support decision-making.
Key Features: Defined by William H. Inmon as subject-oriented, integrated, time-variant, and
non-volatile.

Data Lake
Overview: A centralized repository for storing large volumes of raw data (structured, semi-
structured, and unstructured) without predefined schemas.
Flexibility: Data lakes can ingest data from various sources in real-time or batch mode and
support multiple languages for analysis.

Data Lake vs. Data Warehouse


Data Structure: Data lakes store raw data; data warehouses store processed, structured data.
Processing: Data lakes use ELT (Extract, Load, Transform); data warehouses use ETL (Extract,
Transform, Load).
Cost: Data lakes generally have lower storage costs but higher processing costs; data
warehouses are costlier due to structured storage.
Performance: Data warehouses are optimized for fast querying; data lakes require data
processing before analysis.

Major Features of a Data Warehouse


Subject-oriented: Focused on key subjects for decision-making, providing a clear view for
analysis.
Integrated: Combines data from various sources ensuring consistency.
Time-variant: Maintains historical data for analysis.
Non-volatile: Data is stable and does not require frequent updates.
Other Features and Advantages
Separation from Operational Databases: Enhances performance for both systems.
Integration of Heterogeneous Data: Offers a unified view for reporting and analysis.
Improved Decision Making: Reduces resource use on operational systems and provides a
common data model.
Enhanced Security and Customer Service: Safeguards data and improves relationships through
better insights.
The Need for a Separate Data Warehouse
Operational databases focus on daily transactions, while data warehouses cater to complex
analytical queries. Separating these systems maintains performance and ensures the integrity
of decision-making processes.

Data Warehouse Models


Enterprise Warehouse: Collects comprehensive organizational data.
Data Mart: A subset of data for specific user groups.
Virtual Warehouse: Provides a logical view over operational databases for easier access.
Differences Between Operational Database Systems and Data Warehouses
Functionality: Operational systems (OLTP) support daily operations; data warehouses (OLAP)
focus on long-term analysis.
Data Structure: OLTP systems are normalized for efficiency; data warehouses are denormalized
for speed.
User Base: Operational users include clerks and DBAs; data warehouse users are typically
managers and analysts.

Summary of Data Integration and Extraction


Introduction
Data Integration: The process of combining and transforming data from multiple sources into a
unified format for improved decision-making and insights. It ensures data is accurate,
consistent, and accessible across an organization.

Key Issues in Data Integration


Data Quality and Consistency: Ensuring data accuracy and completeness through profiling,
cleansing, and validation.
Data Mapping and Transformation: Establishing clear rules to convert diverse data formats into
a common schema.
Data Security and Privacy: Protecting sensitive data with access controls, encryption, and
compliance with regulations.
Performance and Scalability: Optimizing integration workflows to handle large data volumes
efficiently.
Data Governance and Compliance: Aligning integration with governance frameworks to ensure
accountability and integrity.
Real-time Integration: Implementing mechanisms like change data capture (CDC) for timely
data updates.
Data Ownership and Collaboration: Establishing clear ownership and fostering cooperation
across teams.
Data Latency: Reducing delays through defined refresh schedules and synchronization methods.
Vendor Lock-In: Avoiding reliance on proprietary tools by opting for open standards.
Data Volume and Variety: Using flexible approaches to integrate diverse data types.

Extract, Transform, Load (ETL)


ETL Process: A method for integrating data from multiple sources into a single, consistent data
store, typically for data warehousing. It includes:
Extract: Copying raw data from sources to a staging area (e.g., SQL servers, CRM systems).
Transform: Processing data in the staging area through cleansing, validation, and formatting.
Load: Moving the transformed data into a target system, usually automated and performed
during low-traffic periods.

ETL vs. ELT


Order of Operations: ELT loads raw data directly into the target system for later transformation,
while ETL transforms data before loading.
Use Cases: ELT is advantageous for high-volume unstructured data, while ETL requires more
upfront planning and definition.
Other Data Integration Methods
Change Data Capture (CDC): Captures only changed data for efficient integration.
Data Replication: Copies data in real-time or batches for backups and disaster recovery.
Data Virtualization: Creates a unified view of data without physical copies, enabling virtual data
warehouses and lakes.
Stream Data Integration (SDI): Continuously integrates real-time data streams for immediate
analysis.
Summary of Data Warehouse Modeling
Data Warehouse Modeling: Data Cube
Entity-Relationship Model: Common in relational databases, focusing on entities and
relationships, suitable for online transaction processing (OLTP).
Multidimensional Model: For data warehouses, a concise schema is needed to support Online
Analytical Processing (OLAP), visualized as a data cube.

Data Cube Structure


Dimensions and Facts:

Dimensions: Perspectives or entities for recording data (e.g., time, item, location). Each has an
associated dimension table with descriptive attributes.
Facts: Numeric measures (e.g., dollars sold, units sold) stored in a central fact table, linked to
dimension tables via keys.
Cuboids: Represent different levels of data summarization, forming a lattice of cuboids that
constitute a data cube. The base cuboid shows the lowest level of detail, while the apex cuboid
presents the highest level of summarization.

Conceptual Modeling of Data Warehouse


Multidimensional Models: Common models include star schema, snowflake schema, and fact
constellation schema.
Star Schema:

Features a central fact table with associated dimension tables in a radial pattern.
Simple design with no redundancy, making it easy for querying and performance.
Snowflake Schema:

A variant of the star schema with normalized dimension tables, which reduces redundancy but
complicates queries due to additional joins.
Useful for maintaining data but can affect browsing efficiency.
Fact Constellation Schema:

Allows multiple fact tables to share dimension tables, resembling a galaxy or collection of stars.
Useful for complex applications requiring detailed analytics across different measures (e.g.,
sales and shipping).

Summary of Data Warehouse and OLAP Technology


Introduction
Data Warehousing and OLAP: Essential components for decision support, now widely adopted
in the database industry, with major vendors offering related products.
Decision Support vs. OLTP: Data warehousing and OLAP differ significantly from traditional
online transaction processing (OLTP), focusing more on analytical queries.

Meaning of OLAP
OLAP Definition: On-Line Analytical Processing (OLAP) enables users to generate descriptive and
comparative summaries of multidimensional data. Coined by Tedd Codd in 1993, OLAP allows
interactive data analysis.
Contrast with OLTP: OLAP systems are designed for analysis rather than transaction processing,
which was historically referred to as Decision Support Systems (DSS).

OLAP and Data Warehouse


Differences: A data warehouse manages data, primarily using relational technology, while OLAP
provides a multidimensional view of this data for quick strategic insights.
Complementarity: Data warehouses store data, and OLAP transforms it into useful information
for analysis, enabling complex calculations and interactive access to various data views.

Benefits of OLAP
Consistent Reporting: OLAP ensures steady calculations and coherent reporting, allowing
managers to analyze data both broadly and in detail.
User Independence: Empowers business users to build models independently while maintaining
data integrity.
Efficiency: Reduces query load on transaction systems, leading to more efficient operations and
faster market responsiveness, which can enhance revenue and profitability.

OLAP Functionalities
Roll-up: Aggregates data by climbing up a hierarchy.
Drill-down: Provides more detailed data by stepping down a hierarchy or adding dimensions.
Slice and Dice: Allows users to view a sub-cube by selecting specific dimensions.
Pivot: Rotates data axes for different perspectives.

OLAP Server
Definition: A high-capacity, multi-user data manipulation engine designed for multidimensional
data structures, enabling efficient data access.

Types of OLAP
Relational OLAP (ROLAP): Uses relational databases, providing aggregation navigation and tools
while maintaining data in relational tables.
Multidimensional OLAP (MOLAP): Stores data in multidimensional cubes for fast query
performance, with data pre-summarized.
Hybrid OLAP (HOLAP): Combines features of ROLAP and MOLAP, allowing data to be stored in
both relational databases and multidimensional formats.
Web OLAP (WOLAP): Accessible via web browsers, offering lower investment and easier
deployment but with limited functionality compared to traditional OLAP.
Desktop OLAP (DOLAP): Users can download and work with data locally, offering ease of use at
a lower cost but with limited features.
Mobile OLAP (MOLAP): Provides OLAP functionalities on mobile devices, enabling remote data
access.
Spatial OLAP (SOLAP): Integrates GIS capabilities with OLAP, managing both spatial and non-
spatial data for enhanced exploration.
Summary of Data Mining
Introduction
Definition: Data mining involves discovering patterns and insights from large datasets using
various techniques, crucial for decision-making, prediction, and optimization.
Applications: Common uses include customer profiling, targeting competitive customers, and
market-basket analysis for cross-selling strategies.
Exploratory Data Analysis: Often called exploratory data analysis, it utilizes classical statistical
methods alongside automated AI techniques.

Requirements for Data Mining


Problem Identification: A clear problem definition and relevant data collection are essential.
Tools and Techniques: Versatile and scalable data mining tools are necessary, allowing for
effective analysis and predictions.
Analyst Judgment: Human input is critical for selecting data and understanding statistical
concepts, which guide the mining process.

Data Mining Process


General Process: Data mining typically follows a structured process, with CRISP-DM and SEMMA
being two prominent methodologies.

Cross-Industry Standard Process for Data Mining (CRISP-DM)


Business Understanding: Define objectives, assess current situations, and develop a project
plan.
Data Understanding: Collect and explore data, verify quality, and identify initial patterns.
Data Preparation: Clean, select, and format data for modeling.
Modeling: Utilize tools for analysis and develop predictive models using various techniques.
Evaluation: Assess model results against business objectives, refining understanding through
iteration.
Deployment: Implement models for practical applications and monitor them for ongoing
relevance.

SEMMA (Sample, Explore, Modify, Model, Assess)


Sample: Extract a representative sample of the data to ensure efficient processing.
Explore: Search for trends and anomalies to refine the understanding of the dataset.
Modify: Create and transform variables based on exploratory findings, focusing on significant
predictors.
Model: Construct and test various models to predict outcomes based on the prepared data.
Assess: Evaluate model performance using reserved data samples to ensure reliability and
validity.

Comparison of CRISP-DM and SEMMA


Process Phases: CRISP-DM has six phases emphasizing iterative processes and business
alignment, while SEMMA has five phases focused on modeling.
Focus: CRISP-DM emphasizes understanding business problems; SEMMA prioritizes modeling
techniques.
Data Preparation: In CRISP-DM, data preparation is a distinct phase; in SEMMA, it's integrated
into the Modify phase.
Deployment: Both processes involve model deployment, but CRISP-DM stresses ongoing
monitoring more than SEMMA.

Summary of Data Mining Tasks


Introduction
Data mining is the process of uncovering patterns and insights from large datasets using various
algorithms and statistical models. Key tasks in data mining include classification, clustering,
association rule mining, regression, time series analysis, prediction, summarization, and
sequence discovery.

1. Classification
Definition: Assigns data points to predefined categories based on labeled training data.
Examples: Email spam detection, credit risk assessment, and disease diagnosis.
Scenario: Predicting fraudulent transactions using algorithms like decision trees and neural
networks.
2. Clustering
Definition: Groups similar data points into clusters without predefined categories.
Examples: Segmenting customers based on demographics and buying behavior.
Scenario: Using k-means or hierarchical clustering to find natural groupings among customers.
3. Regression
Definition: Maps data items to real-valued outcomes, identifying the best-fitting function (e.g.,
linear).
Example: A professor predicts retirement savings using linear regression based on past values.
4. Time Series Analysis
Definition: Examines how attribute values change over time, typically at regular intervals.
Functions: Determine similarity between time series, classify their behavior, and predict future
values.
Scenario: Charting stock prices to decide on investments based on volatility and growth.
5. Prediction
Definition: Forecasts future data states based on historical data, often using classification
methods.
Examples: Flood prediction using river monitoring data.
Scenario: Predicting water levels from upstream sensors.
6. Summarization
Definition: Maps data into subsets with simple descriptions to provide insights.
Example: A financial statement summarizing a company's performance, including revenue and
expenses.
7. Association Rules
Definition: Uncovers relationships among data, often used in retail to identify items frequently
purchased together.
Example: A grocery store analyzes purchase data to determine that bread is often bought with
pretzels and jelly, guiding marketing strategies.
8. Sequence Discovery
Definition: Identifies patterns based on the sequential order of events over time.
Example: Analyzing web traffic to find common navigation paths among users, leading to
changes in website structure.

Summary of Association Rule Mining


Overview
Association rule mining identifies relationships between items in transactional data, commonly
used in retail for marketing, inventory control, and product placement. These rules indicate
patterns of co-occurrence among products but do not imply causation.
Key Concepts
Association Rules: Typically expressed as 𝑋→𝑌 where 𝑋 is a set of items, and 𝑌 is a single item.
They reveal common item combinations in transactions.
Measures:
Support: The frequency of an itemset appearing in the dataset. E.g., if peanut butter is bought
with bread 100% of the time, the support for the rule is high.
Confidence: The likelihood of buying 𝑌given 𝑋 is purchased. E.g., a confidence of 80% for 𝑋→𝑌
indicates that 80% of transactions containing 𝑋 also contain Y.
Minimum Support and Confidence: User-defined thresholds filter out infrequent or weak
associations.

Algorithms
Apriori Algorithm:
Generates frequent itemsets using support and confidence metrics.
Scans the dataset to identify frequent single items and builds larger itemsets by combining
them.
Prunes infrequent candidates based on the "Apriori property," which states that all supersets of
an infrequent itemset are also infrequent.
Generates association rules from discovered frequent itemsets.
FP-Growth Algorithm:
An efficient alternative to Apriori that constructs a compact FP-tree, reducing database scans.
Builds the FP-tree from the transaction database and mines frequent itemsets directly from the
tree.
Avoids candidate generation, making it faster and more suitable for large datasets.
Implementation Example
FP-Growth Steps:
Build the FP-tree based on item frequencies.
Mine frequent itemsets from the FP-tree.
Generate association rules by evaluating confidence and pruning those that do not meet the
threshold.

Challenges and Issues


Scalability: Difficulty in processing large datasets and high-dimensional data.
Efficiency: Traditional algorithms can be time-consuming and memory-intensive.
Parameter Selection: Choosing optimal support and confidence thresholds is crucial for relevant
rule generation.
Rule Quality: Determining the interestingness of generated rules remains subjective.
High-Dimensional Data: Handling numerous attributes can complicate rule discovery.
Sparse Data: Infrequent item combinations can yield weak associations.
Continuous and Heterogeneous Data: Traditional methods struggle with non-binary data types.
Domain Knowledge: Incorporating expert knowledge can enhance meaningful rule discovery.
Interpretability: Presenting large rule sets understandably is challenging.
Real-time Data: Adapting algorithms for streaming data requires ongoing research.
Summary: Time Series Mining
Introduction

Time series data consists of observations collected over discrete time intervals, recorded
chronologically.
It helps analyze trends, patterns, and seasonality, aiding in forecasting and decision-making
across various applications.

Key Terms

Time series data: Observations linked to specific timestamps.


Time steps: Discrete intervals in the time series.
Trend: Long-term direction in the data.
Seasonality: Recurring patterns at fixed intervals (e.g., daily, weekly).
Autocorrelation: Correlation of a time series with its past values.
Stationarity: Consistency of statistical properties over time.
Forecasting: Predicting future values based on past data.
Resampling: Adjusting time intervals of the data.
Smoothing: Techniques to reduce noise in the data.

Applications of Time Series

Forecasting: Predicting future values (e.g., stock prices, sales).


Anomaly Detection: Identifying unusual patterns (e.g., fraud).
Classification: Using features for event classification (e.g., activity recognition).
Pattern Recognition: Discovering hidden patterns (e.g., financial data).
Regression: Modeling relationships between time series and other variables.
Clustering: Grouping similar time series data.

Decomposition in Time Series Data


Decomposition breaks a time series into three components:
Trend: Overall movement in the data (can be linear or nonlinear).
Seasonality: Regular patterns that repeat (additive or multiplicative).
Residual: Random fluctuations not explained by trend or seasonality.

Methods for Time Series Decomposition


Moving Averages: Estimates trends by averaging fixed observations.
Classical Decomposition: Uses statistical techniques to separate components.
STL (Seasonal and Trend decomposition using LOESS): A robust method for decomposition.
X-12-ARIMA: A seasonal adjustment program combining ARIMA modeling with filters.

Time Series Forecasting


Involves predicting future values based on historical data through several steps:
Data Preparation: Cleaning and transforming data.
Exploratory Data Analysis (EDA): Visual and statistical analysis of historical data.
Model Selection: Choosing an appropriate forecasting model.
Model Training and Validation: Splitting data for training and evaluating model performance.
Forecasting and Evaluation: Making predictions and comparing them with actual values.
Model Refinement and Iteration: Adjusting parameters and models based on evaluation results.
Forecast Visualization and Communication: Presenting results effectively.

Examples of Forecasting Models


ARIMA: Captures linear relationships in stationary data.
SARIMA: Extends ARIMA to include seasonality.
Prophet: Handles complex patterns, including seasonal and holiday effects.
Machine Learning Models: Such as random forests and neural networks, can also be utilized for
capturing complex patterns.
Summary: Sequence Mining
Introduction
Sequence mining, or sequential pattern mining, is a technique in data mining aimed at
discovering patterns or sequences of events in sequential data.
It focuses on identifying frequently occurring subsequences and extracting insights from data.
This technique is applicable in various fields, including market basket analysis, web
clickstreams, DNA analysis, healthcare, and customer behavior analysis.
Examples of Sequence Mining Applications

Market Basket Analysis:


This involves identifying commonly purchased item combinations in retail.
Retailers analyze transaction data to discover product associations for cross-selling, inventory
management, and targeted marketing.
For example, customers buying diapers may also frequently purchase baby wipes.

Web Clickstreams:
Sequence mining analyzes user navigation patterns on websites by examining the order of web
pages visited.
Insights can reveal user preferences and navigation bottlenecks.
For instance, it may be found that users follow a specific page sequence before making a
purchase.

Customer Behavior Analysis:


This application looks at customer interactions, such as browsing and purchase history.
By analyzing the sequence of actions, businesses can identify behavior patterns and predictors
of churn.
For example, patterns in mobile app interactions may help identify users likely to leave,
enabling proactive retention strategies.

Healthcare:
Sequence mining aids in analyzing patient treatment patterns to identify effective strategies.
By examining sequences of medical procedures, medications, and diagnoses, healthcare
providers can identify common paths leading to successful outcomes or adverse events.
This information helps in creating personalized treatment plans and improving patient care.

Summary of Social Network Mining


Introduction
Social media platforms are key interaction tools today, with Facebook hosting around 2.958
billion users and TikTok being the fastest-growing.
Businesses targeting millennials and Gen Z must utilize social media, as users typically engage
with at least seven platforms, generating significant big data daily.

What is Social Media Mining?


Social media mining, or analytics, involves collecting and analyzing data from social media to
extract insights, identify trends, and inform decisions.
It uses publicly available data (demographics, user-generated content) to gauge attitudes and
behaviors towards topics or products.

Key Aspects and Techniques


Data Collection: Involves using APIs for structured data retrieval, as well as crawling and
scraping web content. Real-time data streams allow for immediate insights but pose challenges
like noise and integration complexities.
Data Normalization: Ensures data consistency across platforms through standardization of
textual data, timestamps, and quantitative metrics.
Text Analytics: Applies natural language processing (NLP) to extract insights from text. It
includes sentiment analysis to determine emotional tones in social media content.

Benefits of Text Analytics


Increases efficiency, improves decision-making, enhances customer service, and boosts sales
through better understanding of customer needs and behaviors.

Sentiment Analysis
A critical part of text analytics, sentiment analysis classifies text as positive, negative, or neutral,
using machine learning and rule-based methods.
Challenges include ambiguity in language, subjectivity of opinions, and potential biases in
models.

Summary of Topic Modeling and Related Techniques


Topic Modeling
Definition: A method used to uncover underlying themes in social media text by statistically
modeling word co-occurrences to create topics.
Techniques: Common methods include Latent Dirichlet Allocation (LDA) and Non-negative
Matrix Factorization (NMF).

Applications:
Document Clustering: Groups related documents.
Text Summarization: Highlights main topics.
Sentiment Analysis: Assesses sentiment in documents.
Question Answering: Extracts information based on queries.

Benefits of Topic Modeling


Increased Efficiency: Streamlines processing of large text datasets.
Improved Decision-Making: Provides insights that enhance business strategies.
Reduced Risk: Helps in fraud detection and regulatory compliance.

Named Entity Recognition (NER)


Definition: A subtask of NLP that identifies and classifies named entities in text into categories
like names, organizations, and locations.

Methods:
Rule-based Systems: Use hand-crafted rules but are hard to maintain.
Statistical Systems: Employ machine learning for better accuracy.
Hybrid Systems: Combine both approaches for improved performance.

Challenges in NER
Ambiguity: Context-dependent meanings can confuse classification.
Abbreviations and Variations: Named entities can be abbreviated or vary morphologically,
complicating identification.
Domain-Specific Language: Specific terms can challenge model effectiveness.

Text Classification
Definition: Categorizing text into predefined classes for applications like spam detection and
sentiment classification.
Common Algorithms: Naive Bayes, Support Vector Machines (SVM), CNNs, and RNNs.
Challenges in Text Classification
Data Scarcity and Imbalance: Lack of labeled data and uneven class distributions can hinder
model performance.
Ambiguity and Domain-Specific Language: Words with multiple meanings or specialized
terminology complicate classification.

Trend Analysis
Definition: Analyzes social media text to identify and track trends over time.
Visualization Tools: Use of word clouds and time series plots to represent data trends.

Value of Trend Analysis


Identifies Opportunities: Helps businesses find new marketing avenues.
Tracks Customer Sentiment: Provides insights into public perceptions.
Monitors Brand Reputation: Detects negative trends affecting brands.

Text Summarization
Definition: Condensing text while retaining key information.
Types:
Extractive Summarization: Pulls important sentences directly from the text.
Abstractive Summarization: Generates new text summarizing the document's meaning.
Benefits of Text Summarization
Increased Efficiency: Saves time by summarizing long documents.
Improved Comprehension: Highlights essential information for better understanding.
Enhanced Creativity: Provides new perspectives for idea generation.
Research Gaps in Social Media Mining
Scalability: Difficulty in processing large datasets as social media grows.
Heterogeneity: Challenges in integrating various data formats (text, images, videos).
Real-Time Processing: Need for efficient algorithms to handle continuously generated data.
Privacy and Security: Importance of protecting sensitive user information.
Explainability: Difficulty in understanding algorithmic decisions affects trust in results.

You might also like