Business Intelligence Overview
Business Intelligence Overview
Business Intelligence Overview
Context
Decision-making at all organizational levels impacts performance.
Decisions can be strategic or tactical; collectively, they define daily operations.
Better decision-making leads to enhanced efficiency, profitability, and customer satisfaction,
which is the core function of Business Intelligence (BI).
History of BI
Early computers were designed for specific calculations, leading to the development of
information storage.
Post-1940s, mass storage and Database Management Systems (DBMS) evolved to handle
growing data.
Businesses recognized the analytical value of data, leading to the creation of data warehouses
for unified data views.
Transactional systems, like Point-of-Sale (POS), were developed for daily operations, followed
by decision support systems and BI models, such as Microsoft Power BI and Tableau.
Definition of Business Intelligence
BI refers to software that processes business data into user-friendly formats (reports,
dashboards).
It encompasses various data types and helps organizations gain insights for decision-making.
BI enables companies to examine trends and make informed predictions.
Key Functions of BI
Data Conversion: Raw data → Information → Insights → Decisions.
BI Big Four:
Accurate Answers: Must reflect organizational reality.
Valuable Insights: Impactful information for organizational success.
Timely Information: Current data for effective decision-making.
Actionable Conclusions: Insights should lead to concrete actions.
BI Value Proposition
BI empowers organizations to ask critical questions and make data-driven decisions.
It provides insights into past and current performance, aiding strategic planning and resource
allocation.
BI enhances customer understanding, operational monitoring, and supply chain management.
BI Lifecycle
Planning: Define goals, identify data sources, select tools.
Data Acquisition: Collect, clean, and integrate data.
Data Modeling: Create structures to manage BI data.
Data Analysis: Identify trends and share insights.
Reporting and Visualization: Present data clearly for decision-making.
Deployment and Maintenance: Implement solutions and ensure data accuracy.
Big Data and BI Nexus
Big Data Characteristics: Volume, Velocity, Variety, Veracity, Value.
Big Data provides the raw material; BI processes convert this data into actionable insights.
BI tools help identify trends, predict future events, and improve customer service.
Types of Analytics
Descriptive: What happened?
Predictive: What might happen?
Prescriptive: What should be done?
Diagnostic: Why did it happen?
Data Lake
Overview: A centralized repository for storing large volumes of raw data (structured, semi-
structured, and unstructured) without predefined schemas.
Flexibility: Data lakes can ingest data from various sources in real-time or batch mode and
support multiple languages for analysis.
Dimensions: Perspectives or entities for recording data (e.g., time, item, location). Each has an
associated dimension table with descriptive attributes.
Facts: Numeric measures (e.g., dollars sold, units sold) stored in a central fact table, linked to
dimension tables via keys.
Cuboids: Represent different levels of data summarization, forming a lattice of cuboids that
constitute a data cube. The base cuboid shows the lowest level of detail, while the apex cuboid
presents the highest level of summarization.
Features a central fact table with associated dimension tables in a radial pattern.
Simple design with no redundancy, making it easy for querying and performance.
Snowflake Schema:
A variant of the star schema with normalized dimension tables, which reduces redundancy but
complicates queries due to additional joins.
Useful for maintaining data but can affect browsing efficiency.
Fact Constellation Schema:
Allows multiple fact tables to share dimension tables, resembling a galaxy or collection of stars.
Useful for complex applications requiring detailed analytics across different measures (e.g.,
sales and shipping).
Meaning of OLAP
OLAP Definition: On-Line Analytical Processing (OLAP) enables users to generate descriptive and
comparative summaries of multidimensional data. Coined by Tedd Codd in 1993, OLAP allows
interactive data analysis.
Contrast with OLTP: OLAP systems are designed for analysis rather than transaction processing,
which was historically referred to as Decision Support Systems (DSS).
Benefits of OLAP
Consistent Reporting: OLAP ensures steady calculations and coherent reporting, allowing
managers to analyze data both broadly and in detail.
User Independence: Empowers business users to build models independently while maintaining
data integrity.
Efficiency: Reduces query load on transaction systems, leading to more efficient operations and
faster market responsiveness, which can enhance revenue and profitability.
OLAP Functionalities
Roll-up: Aggregates data by climbing up a hierarchy.
Drill-down: Provides more detailed data by stepping down a hierarchy or adding dimensions.
Slice and Dice: Allows users to view a sub-cube by selecting specific dimensions.
Pivot: Rotates data axes for different perspectives.
OLAP Server
Definition: A high-capacity, multi-user data manipulation engine designed for multidimensional
data structures, enabling efficient data access.
Types of OLAP
Relational OLAP (ROLAP): Uses relational databases, providing aggregation navigation and tools
while maintaining data in relational tables.
Multidimensional OLAP (MOLAP): Stores data in multidimensional cubes for fast query
performance, with data pre-summarized.
Hybrid OLAP (HOLAP): Combines features of ROLAP and MOLAP, allowing data to be stored in
both relational databases and multidimensional formats.
Web OLAP (WOLAP): Accessible via web browsers, offering lower investment and easier
deployment but with limited functionality compared to traditional OLAP.
Desktop OLAP (DOLAP): Users can download and work with data locally, offering ease of use at
a lower cost but with limited features.
Mobile OLAP (MOLAP): Provides OLAP functionalities on mobile devices, enabling remote data
access.
Spatial OLAP (SOLAP): Integrates GIS capabilities with OLAP, managing both spatial and non-
spatial data for enhanced exploration.
Summary of Data Mining
Introduction
Definition: Data mining involves discovering patterns and insights from large datasets using
various techniques, crucial for decision-making, prediction, and optimization.
Applications: Common uses include customer profiling, targeting competitive customers, and
market-basket analysis for cross-selling strategies.
Exploratory Data Analysis: Often called exploratory data analysis, it utilizes classical statistical
methods alongside automated AI techniques.
1. Classification
Definition: Assigns data points to predefined categories based on labeled training data.
Examples: Email spam detection, credit risk assessment, and disease diagnosis.
Scenario: Predicting fraudulent transactions using algorithms like decision trees and neural
networks.
2. Clustering
Definition: Groups similar data points into clusters without predefined categories.
Examples: Segmenting customers based on demographics and buying behavior.
Scenario: Using k-means or hierarchical clustering to find natural groupings among customers.
3. Regression
Definition: Maps data items to real-valued outcomes, identifying the best-fitting function (e.g.,
linear).
Example: A professor predicts retirement savings using linear regression based on past values.
4. Time Series Analysis
Definition: Examines how attribute values change over time, typically at regular intervals.
Functions: Determine similarity between time series, classify their behavior, and predict future
values.
Scenario: Charting stock prices to decide on investments based on volatility and growth.
5. Prediction
Definition: Forecasts future data states based on historical data, often using classification
methods.
Examples: Flood prediction using river monitoring data.
Scenario: Predicting water levels from upstream sensors.
6. Summarization
Definition: Maps data into subsets with simple descriptions to provide insights.
Example: A financial statement summarizing a company's performance, including revenue and
expenses.
7. Association Rules
Definition: Uncovers relationships among data, often used in retail to identify items frequently
purchased together.
Example: A grocery store analyzes purchase data to determine that bread is often bought with
pretzels and jelly, guiding marketing strategies.
8. Sequence Discovery
Definition: Identifies patterns based on the sequential order of events over time.
Example: Analyzing web traffic to find common navigation paths among users, leading to
changes in website structure.
Algorithms
Apriori Algorithm:
Generates frequent itemsets using support and confidence metrics.
Scans the dataset to identify frequent single items and builds larger itemsets by combining
them.
Prunes infrequent candidates based on the "Apriori property," which states that all supersets of
an infrequent itemset are also infrequent.
Generates association rules from discovered frequent itemsets.
FP-Growth Algorithm:
An efficient alternative to Apriori that constructs a compact FP-tree, reducing database scans.
Builds the FP-tree from the transaction database and mines frequent itemsets directly from the
tree.
Avoids candidate generation, making it faster and more suitable for large datasets.
Implementation Example
FP-Growth Steps:
Build the FP-tree based on item frequencies.
Mine frequent itemsets from the FP-tree.
Generate association rules by evaluating confidence and pruning those that do not meet the
threshold.
Time series data consists of observations collected over discrete time intervals, recorded
chronologically.
It helps analyze trends, patterns, and seasonality, aiding in forecasting and decision-making
across various applications.
Key Terms
Web Clickstreams:
Sequence mining analyzes user navigation patterns on websites by examining the order of web
pages visited.
Insights can reveal user preferences and navigation bottlenecks.
For instance, it may be found that users follow a specific page sequence before making a
purchase.
Healthcare:
Sequence mining aids in analyzing patient treatment patterns to identify effective strategies.
By examining sequences of medical procedures, medications, and diagnoses, healthcare
providers can identify common paths leading to successful outcomes or adverse events.
This information helps in creating personalized treatment plans and improving patient care.
Sentiment Analysis
A critical part of text analytics, sentiment analysis classifies text as positive, negative, or neutral,
using machine learning and rule-based methods.
Challenges include ambiguity in language, subjectivity of opinions, and potential biases in
models.
Applications:
Document Clustering: Groups related documents.
Text Summarization: Highlights main topics.
Sentiment Analysis: Assesses sentiment in documents.
Question Answering: Extracts information based on queries.
Methods:
Rule-based Systems: Use hand-crafted rules but are hard to maintain.
Statistical Systems: Employ machine learning for better accuracy.
Hybrid Systems: Combine both approaches for improved performance.
Challenges in NER
Ambiguity: Context-dependent meanings can confuse classification.
Abbreviations and Variations: Named entities can be abbreviated or vary morphologically,
complicating identification.
Domain-Specific Language: Specific terms can challenge model effectiveness.
Text Classification
Definition: Categorizing text into predefined classes for applications like spam detection and
sentiment classification.
Common Algorithms: Naive Bayes, Support Vector Machines (SVM), CNNs, and RNNs.
Challenges in Text Classification
Data Scarcity and Imbalance: Lack of labeled data and uneven class distributions can hinder
model performance.
Ambiguity and Domain-Specific Language: Words with multiple meanings or specialized
terminology complicate classification.
Trend Analysis
Definition: Analyzes social media text to identify and track trends over time.
Visualization Tools: Use of word clouds and time series plots to represent data trends.
Text Summarization
Definition: Condensing text while retaining key information.
Types:
Extractive Summarization: Pulls important sentences directly from the text.
Abstractive Summarization: Generates new text summarizing the document's meaning.
Benefits of Text Summarization
Increased Efficiency: Saves time by summarizing long documents.
Improved Comprehension: Highlights essential information for better understanding.
Enhanced Creativity: Provides new perspectives for idea generation.
Research Gaps in Social Media Mining
Scalability: Difficulty in processing large datasets as social media grows.
Heterogeneity: Challenges in integrating various data formats (text, images, videos).
Real-Time Processing: Need for efficient algorithms to handle continuously generated data.
Privacy and Security: Importance of protecting sensitive user information.
Explainability: Difficulty in understanding algorithmic decisions affects trust in results.