Complete Notes of BA
Complete Notes of BA
Complete Notes of BA
Unit -1 (Introduction)
Business
Process
Sales = marketing
Analytics
Analytics is a process of discovery, interpretation, and communicating meaningful patterns in data. It denotes
a person’s ‘skill to gather and use data to generate insights that lead to fact-based decision-making
Data-driven analytics provides us with unparalleled opportunities that will help to transform the vast areas
concerning business, healthcare, government, etc. The application of data-driven analytics is especially
valuable in areas rich with recorded information. Analytics banks on the simultaneous application of statistics,
computer programming, and operation research to measure performance. It is observed that analytics most
likely favours data visualization while communicating insight. Analytics also supports the organizations to
use the generated business data. It helps the organizations to describe, predict, and enhance their business
performance. Business analytics makes extensive use of statistical analysis, including explanatory and
predictive modelling, and fact-based management to drive decision making. It is therefore closely related to
management science. Analytics may be used as input for human decisions or may drive fully automated
decisions. Business analytics can answer questions like why is this happening, what if these trends continue,
what will happen next (that is, predict), what is the best that can happen (that is, optimize)
A career in business analytics is a popular choice among those who enjoy working with numbers. To start
working towards a career in BA, you’ll need a bachelor’s degree in business analytics, data science,
information management, business intelligence, marketing, statistics, or a related field.
Some of the more popular career paths related to business analytics are:
• Data Analyst or Data Scientist: As a data scientist, you would collect, analyse, and organize data in a
way that provides the organization with valuable insights that can be utilized by all departments. A
data analyst presents this data to upper management using tables, charts, and other types of reports
• .Business Intelligence Analyst: A business intelligence analyst is different in the way that they will be
gathering and analysing information to gain an advantage over competing organizations. They’ll
present to upper management exactly where their business stands, its strengths and weaknesses, and
how they can bring in a larger profit.
• Big Data Analytics Specialist: Using the latest developments in technology and data science, big data
analytics specialists solve challenges that arise when working within a digital industry. They will often
be asked to weigh in on various decisions using insights gained through data and need to be able to
back up their conclusions with factual evidence.
• Management Analyst or Consultant: The role of a managing analyst consists of working with business
operations and making sure they’re running smoothly and effectively. You’ll work with several other
departments to narrow down which business process needs to be improved while also finding a way
to enhance efficiency.
• Marketing Manager: Those who choose the route of a marketing manager will be required to come up
with the marketing strategies of the organization. Whether that’s overseeing marketing campaigns,
gathering retail analytics, working directly with the sales and marketing teams, or reporting to upper
management, it’ll likely depend on the type of organization and industry.
• Operations Research Analyst: Operation research analysts work to analyze operational data using
information technology to runan analysis and develop solutions to improve efficiencies across varying
departments.
• Market Research Analyst: Those who choose to be a market research analyst will work directly with
marketing data. This type of information will help to identify potential customers, evaluate the desira
Types of Analytics
• Decision analytics: supports human decisions with visual analytics the user models to reflect reasoning
• Descriptive analytics: gains insight from historical data with reporting, scorecards, clustering etc.
• Predictive analytics: employs predictive modelling using statistical and machine learning techniques
Data Analysis is a broad spectrum that includes Analysis of all kinds, on data sets of all sizes. At a basic level,
working with functions and formatting data in Microsoft Excel is an example of Analysis. Excel was the tool
largely used by businesses for a long time. But as the volume of data grew, Excel couldn‘t be relied upon as a
does it all tool. Analysis tools had to be scaled to fit “bigger” data as well. Therefore, new tools had to be
developed to deal with Big Data. This led to the birth of Hadoop. Analysis largely deals with analysing past
data and understanding the data. Analytics deals with using these insights to make smart business decisions in
the future.
The integration of artificial intelligence (AI) and machine learning (ML) has revolutionized business analytics.
These technologies enhance the ability to analyze massive datasets, automate decision-making processes, and
provide real-time insights.
The evolution of data analytics brought forth advanced techniques that enabled organizations to go beyond
descriptive analytics and delve into predictive and prescriptive analytics. Predictive analytics leveraged
historical data and statistical models to forecast uture outcomes, enabling proactive decision-making.
The historical development of business and its processes of development up to now is called the evolution of
business. The development of Industry can be started through five stages such as handicraft, an age of guild,
domestic system, the industrial revolution and present stage.
Analytics process
Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of
discovering useful information, informing conclusions, and supporting decision-making.
• Establish a goal. First, determine the purpose and key objectives of your data analysis.
• Determine the type of data analytics to use.
• Determine a plan to produce the data.
• Collect the data.
• Clean the data.
• Evaluate the data.
• Visualize the data.
• Descriptive analysis
In the world of big data and analytics, there are three key roles that are essential to any data-driven
organization: data scientist, data engineer, and data analyst. While the job titles may sound similar, there are
significant differences between the roles. In this article, we will explore the differences between data scientist,
data engineer, and data analyst, and how each of these roles contributes to the overall success of a data-driven
organization.
Generally, we hear different designations about CS Engineers like Data Scientist, Data Analyst and Data
Engineer. Let us discuss the differences between the above three roles.
Data Analyst –
The main focus of this person’s job would be on optimization of scenarios, say how an employee can improve
the company’s product growth. Data Cleaning and organizing of raw data, analyzing and visualization of
data to interpret the analysis and to present the technical analysis of data. Skills needed for Data Analyst are
R, Python, SQL, SAS, SAS Miner. A data analyst is responsible for collecting, organizing, and analyzing data
to identify patterns and insights that can be used to make data-driven decisions. Data analysts work with
structured data, such as spreadsheets and databases, and are responsible for creating reports and dashboards
that communicate key insights to stakeholders.
Key Responsibilities of a Data Analyst:
• Collecting and cleaning structured data sets
• Creating reports and dashboards to communicate key insights to stakeholders
• Identifying patterns and trends in data to drive business decisions
• Collaborating with data scientists and data engineers to ensure data quality and consistency
• Staying up-to-date with the latest data analysis tools and techniques
Data Scientist –
The predominant focus will be on the futuristic display of data. They provide both supervised and
unsupervised learning of data, say classification and regression of data, Neural networks. The continuous
regression analysis would be using machine learning techniques. Skills needed for Data Scientist are R,
Python, SQL, SAS, Pig, Apache Spark, Hadoop, Java, Perl. A data scientist is responsible for collecting,
analyzing, and interpreting complex data sets using statistical and machine learning techniques. The data
scientist works with a wide variety of data, including structured, unstructured, and semi-structured data, and
is responsible for finding patterns, trends, and insights that can be used to drive business decisions.
Key Responsibilities of a Data Scientist:
• Collecting and cleaning large data sets
• Building predictive models using statistical and machine learning techniques
• Communicating insights and recommendations to stakeholders
• Developing data visualizations to communicate complex data in a simple manner
• Collaborating with data engineers to ensure data is accurate and consistent
• Staying up-to-date with the latest data science techniques and technologies
Data Engineer –
Data Engineers concentrate more on optimization techniques and building of data in a proper manner. The
main aim of a data engineer is continuously improving the data consumption. Mainly a data engineer works
at the back end. Optimized machine learning algorithms were used for maintaining data and to make data to
be available in most accurate manner. Skills needed for Data Engineer are Pig, Hive, Hadoop, MapReduce
techniques. A data engineer is responsible for designing and implementing the infrastructure and tools needed
to collect, store, and process large amounts of data. Data engineers work with a wide variety of data storage
technologies, such as Hadoop, NoSQL, and SQL databases, and are responsible for ensuring the data is
accurate, consistent, and available for analysis.
Key Responsibilities of a Data Engineer:
• Designing and implementing data pipelines to collect and process large amounts of data
• Managing and optimizing data storage technologies such as Hadoop, NoSQL, and SQL databases
• Building and maintaining data warehouses and data lakes
• Ensuring data quality and consistency across multiple sources
• Working with data scientists to ensure the accuracy and consistency of the data used for analysis
• Staying up-to-date with the latest data storage technologies and best practices
Data Scientist roles are to provide Data Engineer roles are to build Also Data Analyst
supervised/unsupervised learning of data in an appropriate format. A performs data
data, classify and regress data. Data data engineer works at the back cleaning, organizes
Scientists heavily used neural networks, end. A data engineer uses raw data, analyze and
machine learning for continuous optimized machine learning visualize data to
regression analysis. algorithms to maintain data and interpret the analysis.
make data available in the most
appropriate manner.
For example, data scientists typically have a stronger background in statistics and machine learning, while
data analysts typically have a stronger background in mathematics and business. Data engineers typically have
a stronger background in computer science and engineering.
Introduction of R
R is an open-source programming language that is widely used as a statistical software and data analysis tool.
R generally comes with the Command-line interface. R is available across widely used platforms like
Windows, Linux, and macOS. Also, the R programming language is the latest cutting-edge tool.
Concept of R
R is a programming language and a software environment for statistical computing and graphics. Microsoft R
Open is a version of R that was created by the Microsoft Corporation. Both R and Microsoft R Open are free
and open-source tools for data science and analytics.
Main role of R
R is widely used in data science by statisticians and data miners for data analysis and the development of
statistical software. R is one of the most comprehensive statistical programming languages available, capable
of handling everything from data manipulation and visualization to statistical analysis.
----------------------------------------------------------------------------------------------------------------------------------
A data warehouse is a central repository of information that can be analyzed to make more informed decisions.
Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically
on a regular cadence.
Data warehousing is a method of organizing and compiling data into one database, whereas data mining deals
with fetching important data from databases. Data mining attempts to depict meaningful patterns through a
dependency on the data that is compiled in the data warehouse.
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data. Data
warehouse is kept separate from organization operational database and it can be said that a data warehouse is
a more extensive form of DBMS data.
ETL
ETL stands for extract, transform, and load, and ETL tools move data between systems. If ETL were for
people instead of data, it would be akin to public and private transportation. Companies use ETL to safely and
reliably move their data from one system to another.
ETL Process Example: Extracting, Transforming, and Loading Data from a Retail Database to a Data
Warehouse. A use case example of an ETL process would be a retail company that is looking to improve data
management and analyse sales data from various store locations.
Extract
During data extraction, raw data is copied or exported from source locations to a staging area. Data
management teams can extract data from a variety of data sources, which can be structured or unstructured.
Those sources include but are not limited to:
• SQL
• CRM and ERP systems
• Flat files
• Email
• Web pages
Transform
In the staging area, the raw data undergoes data processing. Here, the data is transformed and consolidated for
its intended analytical use case. This phase can involve the following tasks:
Star Schema
A star schema is a multi-dimensional data model used to organize data in a database so that it is easy to
understand and analyze. Star schemas can be applied to data warehouses, databases, data marts, and other
tools. The star schema design is optimized for querying large data sets.
Better-performing queries: By removing the bottlenecks of a highly normalized schema, query speed
increases, and the performance of read-only commands improves. Provides data to OLAP systems: OLAP
(Online Analytical Processing) systems can use star schemas to build OLAP cubes.
Data mining is the process of analysing a large batch of information to discern trends and patterns. Data mining
can be used by corporations for everything from learning about what customers are interested in or want to
buy to fraud detection and spam filtering.
Data mining is the process of discovering actionable information from large sets of data. Data mining uses
mathematical analysis to derive patterns and trends that exist in data.
Data mining is the process of searching and analyzing a large batch of raw data in order to identify patterns
and extract useful information. Companies use data mining software to learn more about their customers. It
can help them to develop more effective marketing strategies, increase sales, and decrease costs.
This describes the data mining tasks that must be carried out. It includes various tasks such as classification,
clustering, discrimination, characterization, association, and evolution analysis.
Data mining techniques
• Classification analysis. This analysis is used to retrieve important and relevant information about data,
and metadata.
• Association rule learning.
• Anomaly or outlier detection.
• Clustering analysis.
• Regression analysis.
Data mining for retail industry, health industry, insurance and telecommunication sector
Retail sector
Those in the retail industry can use data mining software to categorize customers based on shared
characteristics. These subgroups may be divided by demographics, shopping behaviours, past purchase
history, etc. Retailers can use the data collected to reach their target customers more effectively.
Retailers use data analytics to improve inventory management, marketing efforts, pricing, and product
allocations. Retail analytics involves using software to collect and analyse data from physical, online, and
catalos’ outlets to provide retailers with insights into customer behaviour and shopping trends
Health sector
Data mining helps the healthcare providers to identify the present and future requirements of patients and their
preferences to enhance their satisfaction levels. Large amount of data is collected with the advancement in
electronic health record.
Healthcare data mining is useful in aspects of predictive medicine, customer relationship handling, and
assessing the efficiency of treatments.
Insurance sectors
Data mining can help insurance companies improve their decision making, risk management, customer
segmentation, fraud detection, and product development.
By analysing historical claims data and utilizing predictive modelling, insurers can automate and streamline
the claims assessment process. This leads to faster claims resolution, reduced administrative costs, and
improved customer satisfaction.
Telecommunication sectors
Numerous data mining applications have been de- played in the telecommunications industry. However, most
applications fall into one of the following three categories: marketing, fraud detection, and network fault
isolation and prediction.
Process mining helps telecommunication companies identify anomalies and predict fraudulent activities by
monitoring user behavior and operational workflow. With process mining, communication organizations can
ensure: IoT security. Preventing data leaks.
Unit -III
Data visualization is the process of using visual elements like charts, graphs, or maps to represent data. It
translates complex, high-volume, or numerical data into a visual representation that is easier to process. Data
visualization tools improve and automate the visual communication process for accuracy and detail.
Tables
A table is a systematic arrangement of statistical information in rows and columns. The rows of a table are the
horizontal arrangement of data whereas the columns of a table are the vertical arrangement of data. A table
can be one way and two way table.
Chart
A Chart is a graphical representation for data visualization, in which "the data is represented by symbols, such
as bars in a bar chart, lines in a line chart
Cross-tabulation
Cross tabulation is a useful analysis tool commonly used to compare the results for one or more variables with
the results of another variable. It is used with data on a nominal scale, where variables are named or labeled
with no specific order.
Tableau
Tableau is most known for its wide range of data visualization capabilities, and is often used interchangeably
with other traditional BI tools. Analysts use it to examine data with SQL and build data solutions for business
decision-makers, who in turn use it to analyze data without having to code.
Column Chart
Line Graph
A line graph is a graph that connects individual data points with lines. A line graph depicts quantitative values
over a given time period
Bar Graph
Bar chart use horizontal bars to compare two or more types or things. The categories being compared are
shown on one axix and the data values are shown on the others
Certain parameters in a business are internconned with one other. As a result, the relationship between two
variables is represented using a dual axis chart.
Pie chart
A pie chart depicts how well an entire quantity is divided among levels of a categorical variable as a circled
devided into radial slices.
Scatter chart
A scatter plots to represent the value of two numerical variables. Each dot on the horizontal and verical axes
represent a value for a single data point. Scatter plots used to investigate the relationship between variable.
Pictorial chart
Pictorial chart employ symbols / images to provide a more visually appealing overall view of subsets of
discreate data.
Area chart
An area chart is a hybride of a line chart and a bar chart. It depicts how numerical values of one or more group
shift as a second variable, typically time, progresses.
Histograms chart
A histogram is a graphical representation of frequency distribution. It is a graph that display the number of
observation within each interval.
Data modelling is the process of creating a simplified visual diagram of a software system and the data
elements it contains, using text and symbols to represent the data and how it flows. Data models provide a
blueprint to businesses for designing a new database or reengineering a legacy application.
These models help teams to manage data and convert them into valuable business information
Central Tendency
One of the most important aspects of describing distribution is the central value around which the observation
is distributed. A statistical measure used for representing the centre or central value of a set of observation is
known as a measure of central tendency. The central value also called an average. An average help to get a
representee value of the entire mass of data, facility comparison and finally useful in decision making. There
are three averages in common use like mean, median and mode. We can also define as the statistical measures
which tell us the location or position of central value or central point to describe the central tendency of the
entire mass of data is known as measure of central tendency. The concept of central tendency is used where
the variable is used in system.
Example = marks, income, sales, placement, speed, expenditure, production, price. Rainfall etc
• To get a single value that describe the characteristic of the entire group
• To facilitate comparison
Mean
The mean may be defined as the sum or aggregate of a series of item divided by their numbers. The basis of
selection of mean is total and the number of given data. It gives more weight age to extreme high value or
extreme low value. It may not be present in the actual data.
The arithmetic mean is the most widely used measure of location. It requires the interval scale. Its major
characteristics are:
• It is unique.
Median
The median is that value of variable which divides the group into two equal parts, one part comprising of all
value grater and the other all value less than the median. The basis of selection of median is position.
• It is not affected by extremely large or small values and is therefore a valuable measure of central
tendency when such values occur.
• It can be computed for an open-ended frequency distribution if the median does not lie in an open-
ended class.
Mode
The value of the variable which occurs most frequently in a distribution is called mode. The variable having
maximum frequency is the mode. The basis of selection of mode is the frequency.
As average or the central value alone cannot describe the distribution adequately, i.e., the average does not
enable us to draw a full picture of set of observations. Two set of observation may have the same average but
the observations in one way scatter widely around this average while in the other cases, all the observation
may be close to the average. Thus the measure of scatterness of observation around their average is necessary
to get a better description of data. The extent or degree to which data to spread around an average is called
dispersion or variation. Measure of dispersion may be absolute or relative. Absolute measure of dispersion is
expressed in the units of given observation. Such measure is useful for measure of dispersion of two or more
distribution in which the units of measurement are the same. On the other hand, relative measure of dispersion
are pure unit less number useful for the comparing the variability in two or more distribution in which units
of measurements are different. Also we can say that a measure of dispersion is designed to state the extent to
which individual observation varies from their average or the measurement of a scatterness of the mass of
figure in a series about an average is called measure of dispersion.
Measure of Dispersion
• A measure of location, such as the mean or the median, only describes the center of the data. It is
valuable from that standpoint, but it does not tell us anything about the spread of the data. For example,
if your nature guide told you that the river ahead averaged 3 feet in depth, would you want to wade
across on foot without additional information? Probably not. You would want to know something about
the variation in the depth.
• A second reason for studying the dispersion in a set of data is to compare the spread in two or more
distributions
Type of Dispersion
Absolute measure
The measure of dispersion is expressed in term of the units of the observation is called absolute measure .Thus
absolute measure of dispersion are useful for comparing variation in two or more distributions where units of
measurements is the same.
(a) Range (b) Quartile Deviation (c) Mean Deviation (d) Standard Deviation
Relative measure
Relative measures are useful for comparing the variability in two or more distribution where units of
measurement may be different. They are expressed in ratio or percentage of coefficient of the absolute measure
of dispersion. It is independent of units. Such measured of dispersion is known as relative measure of
dispersion like
(a) Coefficient of Range (b) Coefficient of Quartile Deviation.(c) Coefficient of Mean Deviation (d)
Coefficient of Variation
Variance
The term variance refers to a statistical measurement of the spread between numbers in a data set. More
specifically, variance measures how far each number in the set is from the mean (average), and thus from
every other number in the set. Variance is often depicted by this symbol: σ2.
Variance is a measure of how data points differ from the mean. According to Layman, a variance is a measure
of how far a set of data (numbers) are spread out from their mean (average) value. Variance means to find the
expected difference of deviation from actual value.
Standard Deviation
A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low, or small,
standard deviation indicates data are clustered tightly around the mean, and high, or large, standard deviation
indicates data are more spread out.
Standard deviation is the spread of a group of numbers from the mean. The variance measures the average
degree to which each point differs from the mean. While standard deviation is the square root of the variance,
variance is the average of the squared difference of each data point from the mean.
Linear Regression
The regression analysis is concerned with the formulation and determination of algebraic expression for the
relationship between the two variables. We use the general form regression lines for this algebraic expression.
These regression lines or the exact algebraic forms of relationship are used for predicting the value of one
variable from that of the other. Here, the variable whose value is to be predicted is called dependent variable
and the variable used for prediction is called independent variable. It is the measure of the average relationship
between two or more variable in term of the original units of data.
Lines of Regression
In a bivariate study, we have two lines of regression
Regression of Y and X
The lines of regression of Yon X is used to predict or estimate the value of the Y for the given value of the
variable X.Thus,Y is the dependent variable and X is an independent variable in this case.
Regression of X on Y
The line of regression of X on Y is used to estimate or predict the value of X for a given value of the variable
X.In this case X is the dependent variable and Y is the independent variable.
Multiple Regression
Multiple regression is a statistical technique that can be used to analyse the relationship between a single
dependent variable and several independent variables. The objective of multiple regression analysis is to use
the independent variables whose values are known to predict the value of the single dependent value.
There are several types of multiple regression analyses (e.g. standard, hierarchical, set wise, stepwise) only
two of which will be presented here (standard and stepwise). Which type of analysis is conducted depends on
the question of interest to the researcher.
Graph Analytics
Graph analytics is an emerging form of data analysis that helps businesses understand complex relationships
between linked entity data in a network or graph.
Graph analytics is a category of tools used to apply algorithms that will help the analyst understand the
relationship between graph database entries.
The structure of a graph is made up of nodes (also known as vertices) and edges. Nodes denote points in the
graph data. For example, accounts, customers, devices, groups of people, organizations, products or locations
may all be represented as a node. Edges symbolize the relationships, or lines of communication, between
nodes. Every edge can have a direction, either one-way or bidirectional, and a weight, to depict the strength
of the relationship.
Once the graph database is constructed, analytics can be applied. The algorithms can be used to identify values
or uncover insights within the data such as the average path length between nodes, nodes that might
be outliers and nodes with dominant activity. It can also be used to arrange the data in new ways such as
partitioning information into sections for individual analysis or searching for nodes that meet specific criteria.
Some common tools used to create graph analytics include Apache Spark GraphX, IBM Graph, Gradoop,
Google Charts, Cytoscape and Gephi.
Simulation
A simulation is a model that mimics the operation of an existing or proposed system, providing evidence for
decision-making by being able to test different scenarios or process changes. This can be coupled with virtual
reality technologies for a more immersive experience.
A simulation is a representation of something, not the real thing, like the simulation of life in New York City,
seen in movies that were shot on Hollywood sound stages and on the streets of Toronto. A simulation is
something that represents something else — it isn't the real thing.
Some examples of computer simulation modelling familiar to most of us include: weather forecasting, flight
simulators used for training pilots, and car crash modelling.
Optimisation
The process of identifying and implementing new methods to make a company more productive and cost
effective is known as business optimisation. Here are some examples of business optimisation
• Decision variable
• Optimisation of an objective function
• Must -active constraints