Data Analytics

Data analytics
and
Visualization
BCDS 501
UNIT 1: Introduction to Data Analytics
Sources of data
The primary data are those which are collected afresh and for the first time, and thus happen to be original in
character.
The secondary data, on the other hand, are those which have already been collected by someone else and which
have already been passed through the statistical process.
The researcher would have to decide which sort of data he would be using (thus collecting) for his study and
accordingly he will have to select one or the other method of data collection. The methods of collecting primary and
secondary data differ since primary data are to be originally collected, while in case of secondary data the nature of
data collection work is merely that of compilation. We describe the different methods of data collection, with the pros
and cons of each method.
Collection of primary data
(i) observation method,
(ii) interview method,
(iii) through questionnaires,
(iv) through schedules, and
(v) other methods which include (a) warranty cards; (b) distributor audits; (c) pantry audits; (d) consumer panels; (e) using
mechanical devices; (f) through projective techniques; (g) depth interviews, and (h) content analysis. We briefly take up
each method separately.
Observation method
The observation method is the most commonly used method specially in studies relating to
behavioural sciences. In a way we all observe things around us, but this sort of observation is not
scientific observation. Observation becomes a scientific tool and the method of data collection for the
researcher, when it serves a formulated research purpose, is systematically planned and recorded and
is subjected to checks and controls on validity and reliability. Under the observation method, the
information is sought by way of investigator’s own direct observation without asking from the
respondent. For instance, in a study relating to consumer behaviour, the investigator instead of asking
the brand of wrist watch used by the respondent, may himself look at the watch.
The main advantage of this method is that subjective bias is eliminated, if observation is done
accurately. Secondly, the information obtained under this method relates to what is currently
happening; it is not complicated by either the past behaviour or future intentions or attitudes. Thirdly,
this method is independent of respondents’ willingness to respond and as such is relatively less
demanding of active cooperation on the part of respondents as happens to be the case in the interview
or the questionnaire method. This method is particularly suitable in studies which deal with subjects
(i.e., respondents) who are not capable of giving verbal reports of their feelings for one reason or the
other
DEMERITS
● Firstly, it is an expensive method.
● Secondly, the information provided by this method is very
limited.
● Thirdly, sometimes unforeseen factors may interfere with the
observational task. At times, the fact that some people are
rarely accessible to direct observation creates obstacle for this
method to collect data effectively.
Observation MERITS
method (i) The researcher is enabled to record the natural behaviour of

the group.
(ii) The researcher can even gather information which could not
easily be obtained if he observes in a disinterested fashion.
(iii) The researcher can even verify the truth of statements made
by informants in the context of a questionnaire or a schedule. But
there are also certain demerits of this type of observation viz., the
observer may lose the objectivity to the extent he participates
emotionally; the problem of observation-control is not solved; and
it may narrow-down the researcher’s range of experience.
The interview method of collecting data involves
presentation of oral-verbal stimuli and reply in terms of oral-
verbal responses. This method can be used through personal
interviews and, if possible, through telephone interviews.
Interview
method (a) Personal interviews: Personal interview method requires
a person known as the interviewer asking questions generally
in a face-to-face contact to the other person or persons. (At
times the interviewee may also ask certain questions and the
interviewer responds to these, but usually the interviewer
initiates the interview and collects the information.) This sort
of interview may be in the form of direct personal
investigation or it may be indirect oral investigation.
Interview method
MERITS
1. More information and that too in greater depth can be obtained.
2. Interviewer by his own skill can overcome the resistance, if any, of the respondents; the interview
method can be made to yield an almost perfect sample of the general population.
3. There is greater flexibility under this method as the opportunity to restructure questions is always
there, specially in case of unstructured interviews.
4. Observation method can as well be applied to recording verbal answers to various questions.
5. Personal information can as well be obtained easily under this method.
6. Samples can be controlled more effectively as there arises no difficulty of the missing returns; non-
response generally remains very low.
7. The interviewer can usually control which person(s) will answer the questions. This is not possible
in mailed questionnaire approach. If so desired, group discussions may also be held.
Interview method
DEMERITS
● It is a very expensive method, specially when large and widely spread geographical sample is taken.
● There remains the possibility of the bias of interviewer as well as that of the respondent; there also remains
the headache of supervision and control of interviewers.
● Certain types of respondents such as important officials or executives or people in high income groups may
not be easily approachable under this method and to that extent the data may prove inadequate.
● This method is relatively more-time-consuming, specially when the sample is large and recalls upon the
respondents are necessary.
● The presence of the interviewer on the spot may over-stimulate the respondent, sometimes even to the extent
that he may give imaginary information just to make the interview interesting.
● Under the interview method the organisation required for selecting, training and supervising the field-staff is
more complex with formidable problems.
● Interviewing at times may also introduce systematic errors.
● Effective interview presupposes proper rapport with respondents that would facilitate free and frank
responses. This is often a very difficult requirement.
Interview method
(b) Telephone interviews: This method of collecting information consists in

contacting respondents on telephone itself. It is not a very widely used method,
but plays important part in industrial surveys, particularly in developed regions.
The chief merits of such a system are:
1. It is more flexible in comparison to mailing method.
2. It is faster than other methods i.e., a quick way of obtaining information.
3. It is cheaper than personal interviewing method; here the cost per response is
relatively low.
4. Recall is easy; callbacks are simple and economical.
Interview method
5. There is a higher rate of response than what we have in mailing method; the
non-response is generally very low.
6. Replies can be recorded without causing embarrassment to respondents.
7. Interviewer can explain requirements more easily.
8. At times, access can be gained to respondents who otherwise cannot be
contacted for one reason or the other.
9. No field staff is required.
10. Representative and wider distribution of sample is possible.
Interview method
But this system of collecting information is not free from demerits. Some of these may be
highlighted.
1. Little time is given to respondents for considered answers; interview period is not likely
to exceed five minutes in most cases.
2. Surveys are restricted to respondents who have telephone facilities.
3. Extensive geographical coverage may get restricted by cost considerations.
4. It is not suitable for intensive surveys where comprehensive answers are required to
various questions.
5. Possibility of the bias of the interviewer is relatively more.
6. Questions have to be short and to the point; probes are difficult to handle
COLLECTION OF DATA THROUGH
QUESTIONNAIRES
This method of data collection is quite popular, particularly in case of big

enquiries. It is being adopted by private individuals, research workers,
private and public organisations and even by governments. In this
method a questionnaire is sent (usually by post) to the persons
concerned with a request to answer the questions and return the
questionnaire. A questionnaire consists of a number of questions printed
or typed in a definite order on a form or set of forms. The questionnaire is
mailed to respondents who are expected to read and understand the
questions and write down the reply in the space meant for the for the
purpose in the questionnaire itself. The respondents have to answer the
questions on their own.
QUESTIONNAIRES
MERITS
● There is low cost even when the universe is large
and is widely spread geographically.
● It is free from the bias of the interviewer; answers are
in respondents’ own words.
● Respondents have adequate time to give well
thought out answers.
● Respondents, who are not easily approachable, can
also be reached conveniently.
● Large samples can be made use of and thus the
results can be made more dependable and reliable.
QUESTIONNAIRES
DEMERITS
● Low rate of return of the duly filled in questionnaires; bias due to no-response is
often indeterminate.
● It can be used only when respondents are educated and cooperating.
● The control over questionnaire may be lost once it is sent.
● There is inbuilt inflexibility because of the difficulty of amending the approach
once questionnaires have been despatched.
● There is also the possibility of ambiguous replies or omission of replies
altogether to certain questions; interpretation of omissions is difficult.
● It is difficult to know whether willing respondents are truly representative.
● This method is likely to be the slowest of all.
SCHEDULES
● This method of data collection is very much like the collection of data through
questionnaire, with little difference which lies in the fact that schedules (proforma
containing a set of questions) are being filled in by the enumerators who are
specially appointed for the purpose. These enumerators along with schedules, go
to respondents, put to them the questions from the proforma in the order the
questions are listed and record the replies in the space meant for the same in the
proforma. In certain situations, schedules may be handed over to respondents
and enumerators may help them in recording their answers to various questions in
the said schedules. Enumerators explain the aims and objects of the investigation
and also remove the difficulties which any respondent may feel in understanding
the implications of a particular question or the definition or concept of difficult
terms.
SCHEDULES
● This method of data collection is very useful in extensive
enquiries and can lead to fairly reliable results. It is, however,
very expensive and is usually adopted in investigations
conducted by governmental agencies or by some big
organisations. Population census all over the world is
conducted through this method.
Collection of secondary data
Secondary data means data that are already available i.e., they refer to the data which have already
been collected and analysed by someone else. When the researcher utilises secondary data, then he has
to look into various sources from where he can obtain them. In this case he is certainly not confronted
with the problems that are usually associated with the collection of original data. Secondary data may
either be published data or unpublished data. Usually published data are available in:
(a) various publications of the central, state are local governments;
(b) various publications of foreign governments or of international bodies and their subsidiary
organisations;
(c) technical and trade journals;
(d) books, magazines and newspapers;
(e) reports and publications of various associations connected with business and industry, banks, stock
exchanges, etc.;
(f) reports prepared by research scholars, universities, economists, etc. in different fields; and
(g) public records and statistics, historical documents, and other sources of published information. The
sources of unpublished data are many; they may be found in diaries, letters, unpublished biographies
and autobiographies and also may be available with scholars and research workers, trade associations,
labour bureaus and other public/ private individuals and organisations.
Classification of data
● Big data can come in multiple forms, including structured and non-structured data such as
financial data, text files, multimedia files, and genetic mappings. Contrary to much of the
traditional data analysis performed by organizations, most of the Big Data is unstructured
or semi-structured in nature, which requires different techniques and tools to process and
analyze. Distributed computing environments and massively parallel processing (MPP)
architectures that enable parallelized data ingest and analysis are the preferred approach to
process such complex data. With this in mind, this section takes a closer look at data
structures. Figure 1-3 shows four types of data structures, with 80-90% of future data
growth coming from nonstructured data types. Though different, the four are commonly
mixed. For example, a classic Relational Database Management System (RDBMS) may
store call logs for a software support call center. The RDBMS may store characteristics of
the support calls as typical structured data, with attributes such as time stamps, machine
type, problem type, and operating system. In addition, the system will likely have
unstructured, quasi- or semi-structured data, such as free-form call log information taken
from an e-mail ticket of the problem, customer chat history, or transcript of a phone call
describing the technical problem and the solution or audio file of the phone call
conversation. Many insights could be extracted from the unstructured, quasi- or semi-
structured data in the call center data.
Classificatio
n of data
Although analyzing structured data tends
to be the most familiar technique, a
different technique is required to meet
the challenges to analyze semi-
structured data (shown as XML), quasi-
structured (shown as a clickstream), and
Classificatio unstructured data. Here are examples of
how each of the four main types of data
n of data structures may look.
● Structured data: Data containing a
defined data type, format, and
structure (that is, transaction data,
online analytical processing [OLAP]
data cubes, traditional RDBMS, CSV
files, and even simple spreadsheets).
Classification of data
Semi-structured data: Textual data files with a discernible pattern that
enables parsing (such as Extensible Markup Language [XML] data files
that are self-describing and defined by an XML schema).
Quasi-structured data: Textual data with erratic data formats that can be
formatted with effort, tools, and time (for instance, web clickstream data
that may contain inconsistencies in data values and formats).
Unstructured data: Data that has no inherent structure, which may include
text documents, PDFs, images, and video.
Characteristics of data
The degree of data quality is expressed in a number of characteristics or dimensions. These can be objective (number of
errors or missing values) or subjective (fitness for purpose). Because the goal defines the relevance and required quality
level of data, naming generic characteristics is difficult; features can also overlap. The most commonly used
characteristics of data quality are: accuracy, reliability, completeness, consistency, timeliness, and uniqueness.
● Relevance: This is a more subjective and comprehensive assessment of data quality. Data is useless if it is not
relevant to the intended purpose. That's why it's crucial to define goals so you know what kind of data you need to
know and what level of quality you need to collect.
● Completeness: The extent to which a dataset contains all the values necessary to complete the task at hand.
Identifying an incomplete dataset is different from looking for empty cells. The lack of first names is not a problem for
an e-mail campaign, but it is if you want to sort this dataset by name. Another example is that having a complete
customer base makes it possible to personalise communication with customers.
The percentage of missing relevant values in a dataset can be calculated vertically (attribute level) or horizontally
(record level).
● Reliability: The degree to which data is true and factual.
● Validity: Data is considered valid if it has the correct format, type, and range. This may differ based on the country,
sector, or standards used. Here are several examples:
● Data type: numeric, boolean, labels.
● Range: values must be within a certain interval; for example, a birth year of 201 is invalid because it is outside the date range.
● Patterns: When dates do not meet established standards, they are considered invalid, for example, MM-DD-YYYYY for a date of birth.
● The strict requirement that a telephone number must contain only digits makes validation easier and prevents errors, so for Dutch
telephone numbers: 13 digits, 0031 instead of +31, and no spaces or hyphens.
Characteristics of data
● Identification numbers instead of names that can be spelled in many ways.
● Accuracy: How effectively does the data describe the real-world conditions it is trying to describe? This is one of the
most important properties of high-quality data. Accuracy can be checked by comparing data with a reliable source.
● Identifiability: The extent to which data records are uniquely identifiable and the dataset is free of duplicate records.
● Consistency: Similar data recorded in different sources should have the same meaning, structure, and format. This
determines reliability. The chance of inconsistencies increases as the number of sources increases. Data in one
location can be updated, but not in another.
For example, data must all have the same structure (+31 versus 0031 for telephone numbers or 10:00 PM versus
10:00 PM) or the same unit (kg versus gramme).
● Currentality: How current is the data? As time passes, the data becomes less useful and less accurate. More current
data is more likely to reflect contemporary reality.
● Metadata: Data about data; the quality of the description of the dataset (definitions, abbreviations, units, calculation
methods, structure, sources).
● Open data: Open data facilitates transparency, accountability, and public participation, for example, by quickly
identifying data inaccuracies. Important obstacles are the commercial value of data and the sensitivity of data.
● Accessibility: How easily and quickly is the required data available? A user who needs isolated data must overcome
numerous difficulties to obtain this data. This is not only a waste of time but also increases the chance that data will
be out of date when it becomes available. Sensitive data is often not made public or only shared under strict
restrictions.
Introduction to big data platform
● Big data refers to extremely large and diverse collections of

structured, unstructured, and semi-structured data that continues to
grow exponentially over time. These datasets are so huge and
complex in volume, velocity, and variety, that traditional data
management systems cannot store, process, and analyze them.
Big Data examples
Tracking consumer behavior and
Monitoring payment patterns and
shopping habits to deliver hyper-
analyzing them against historical
personalized retail product
customer activity to detect fraud in
recommendations tailored to
real time
individual customers
Using AI-powered technologies like

natural language processing to
Combining data and information from
analyze unstructured medical data
every stage of an order’s shipment
(such as research reports, clinical
journey with hyperlocal traffic insights
notes, and lab results) to gain new
to help fleet operators optimize last-
insights for improved treatment
mile delivery
development and enhanced patient
care
Analyzing public datasets of satellite

Using image data from cameras and
imagery and geospatial datasets to
sensors, as well as GPS data, to
visualize, monitor, measure, and
detect potholes and improve road
predict the social and environmental
maintenance in cities
impacts of supply chain operations
V’s of Big data
Big data definitions may vary slightly, but it will always be described in
terms of volume, velocity, and variety. These big data characteristics are
often referred to as the “3 Vs of big data” and were first defined by Gartner
in 2001.
As its name suggests, the most common characteristic

associated with big data is its high volume. This describes
Volume the enormous amount of data that is available for collection
and produced from a variety of sources and devices on a
continuous basis.
Big data velocity refers to the speed at which data is

generated. Today, data is often produced in real time or near
Velocity real time, and therefore, it must also be processed, accessed,
and analyzed at the same rate to have any meaningful
impact.
V’s of Big data
Variety
● Data is heterogeneous, meaning it can come from many different sources and can be structured,
unstructured, or semi-structured. More traditional structured data (such as data in spreadsheets or relational
databases) is now supplemented by unstructured text, images, audio, video files, or semi-structured formats
like sensor data that can’t be organized in a fixed data schema.
In addition to these three original Vs, three others that are often mentioned in relation to harnessing the power of
big data: veracity, variability, and value.
● Veracity: Big data can be messy, noisy, and error-prone, which makes it difficult to control the quality and
accuracy of the data. Large datasets can be unwieldy and confusing, while smaller datasets could present an
incomplete picture. The higher the veracity of the data, the more trustworthy it is.
● Variability: The meaning of collected data is constantly changing, which can lead to inconsistency over time.
These shifts include not only changes in context and interpretation but also data collection methods based
on the information that companies want to capture and analyze.
● Value: It’s essential to determine the business value of the data you collect. Big data must contain the right
data and then be effectively analyzed in order to yield insights that can help drive decision-making.
Big data benefits
Improved decision-making
● Big data is the key element to becoming a data-driven organization. When
you can manage and analyze your big data, you can discover patterns and
unlock insights that improve and drive better operational and strategic
decisions.
Increased agility and innovation
● Big data allows you to collect and process real-time data points and analyze
them to adapt quickly and gain a competitive advantage. These insights can
guide and accelerate the planning, production, and launch of new products,
features, and updates.
Better customer experiences
● Combining and analyzing structured data sources together with
unstructured ones provides you with more useful insights for consumer
understanding, personalization, and ways to optimize experience to better
Big data benefits
Continuous intelligence
● Big data allows you to integrate automated, real-time data streaming
with advanced data analytics to continuously collect data, find new
insights, and discover new opportunities for growth and value.
More efficient operations
● Using big data analytics tools and capabilities allows you to process
data faster and generate insights that can help you determine areas
where you can reduce costs, save time, and increase your overall
efficiency.
Improved risk management
● Analyzing vast amounts of data helps companies evaluate risk better
—making it easier to identify and monitor all potential threats and
report insights that lead to more robust control and mitigation
strategies.
Challenges of implementing big data analytics
Lack of data talent and Speed of data growth. Problems with data Compliance violations. Integration complexity. Security concerns. Big
skills. Data scientists, Big data, by nature, is quality. Data quality Big data contains a lot of Most companies work data contains valuable
data analysts, and data always rapidly changing directly impacts the sensitive data and with data siloed across business and customer
engineers are in short and increasing. Without quality of decision- information, making it a various systems and information, making big
supply—and are some of a solid infrastructure in making, data analytics, tricky task to applications across the data stores high-value
the most highly sought place that can handle and planning strategies. continuously ensure data organization. Integrating targets for attackers.
after (and highly paid) your processing, storage, Raw data is messy and processing and storage disparate data sources Since these datasets are
professionals in the IT network, and security can be difficult to curate. meet data privacy and and making data varied and complex, it
industry. Lack of big needs, it can become Having big data doesn’t regulatory requirements, accessible for business can be harder to
data skills and extremely difficult to guarantee results unless such as data localization users is complex, but implement
experience with manage. the data is accurate, and data residency laws. vital, if you hope to comprehensive
advanced data tools is relevant, and properly realize any value from strategies and policies to
one of the primary organized for analysis. your big data. protect them.
barriers to realizing value This can slow down
from big data reporting, but if not
environments. addressed, you can end
up with misleading
results and worthless
Big
● A big data platform is an integrated
computing solution that combines
numerous software systems, tools, and
hardware for big data management. It is
Data a one-stop architecture that solves all

the data needs of a business regardless
of the volume and size of the data at
platfor hand. Due to their efficiency in data

management, enterprises are
increasingly adopting big data platforms
m
to gather tons of data and convert them
into structured, actionable business
insights.
Characteristics of big data platform
● Ability to accommodate new applications and tools depending on the evolving business
needs
● Support several data formats
● Ability to accommodate large volumes of streaming or at-rest data
● Have a wide variety of conversion tools to transform data to different preferred formats
● Capacity to accommodate data at any speed
● Provide the tools for scouring the data through massive data sets
● Support linear scaling
● The ability for quick deployment
● Have the tools for data analysis and reporting requirements
Big data platforms
a. Apache Hadoop
● Apache Hadoop is one of the industry's most widely used big data platforms. It is an
open-source framework that enables distributed processing for massive datasets
throughout clusters. Hadoop provides a scalable and cost-effective solution for
storing, processing, and analyzing massive amounts of structured and unstructured
data.
● One of the key features of Hadoop is its distributed file system, known as Hadoop
Distributed File System (HDFS). HDFS enables data to be stored across multiple
machines, providing fault tolerance and high availability. This feature allows
businesses to store and process data at a previously unattainable scale. Hadoop
also includes a powerful processing engine called MapReduce, which allows for
parallel data processing across the cluster. The prominent companies that use
Apache Hadoop are:
● Yahoo
● Facebook
● Twitter
Big data platforms
b. Apache Spark
● Apache Spark is a unified analytics engine for batch processing, streaming data,
machine learning, and graph processing. It is one of the most popular big data
platforms used by companies. One of the key benefits that Apache Spark offers is
speed. It is designed to perform data processing tasks in-memory and achieve
significantly faster processing times than traditional disk-based systems.
● Spark also supports various programming languages, including Java, Scala, Python,
and R, making it accessible to a wide range of developers. Hadoop offers a rich set
of libraries and tools, such as Spark SQL for querying structured data, MLlib for
machine learning, and GraphX for graph processing. Spark integrates well with other
big data technologies, such as Hadoop, allowing companies to leverage their
existing infrastructure. The prominent companies that use Apache Spark include:
● Netflix
● Uber
● Airbnb
Big data platforms
c. Google Cloud BigQuery
● Google Cloud BigQuery is a top-rated big data platform that provides a fully managed and
serverless data warehouse solution. It offers a robust and scalable infrastructure for
storing, querying, and analyzing massive datasets. BigQuery is designed to handle
petabytes of data and allows users to run SQL queries on large datasets with impressive
speed and efficiency.
● BigQuery supports multiple data formats and integrates seamlessly with other Google
Cloud services, such as Google Cloud Storage and Google Data Studio. BigQuery's unique
architecture enables automatic scaling, ensuring users can process data quickly without
worrying about infrastructure management. BigQuery offers a standard SQL interface for
querying data, built-in machine learning algorithms for predictive analytics, and geospatial
analysis capabilities. The prominent companies that use Google Cloud BigQuery are:
● Spotify
● Walmart
● The New York Times
Big data platforms
d. Amazon EMR
● Amazon EMR is a widely used big data platform from Amazon Web Services (AWS).
It offers a scalable and cost-effective solution for processing and analyzing large
datasets using popular open-source frameworks such as Apache Hadoop, Apache
Spark, and Apache Hive. EMR allows users to quickly provision and manage clusters
of virtual servers, known as instances, to process data in parallel.
● EMR integrates seamlessly with other AWS services, such as Amazon S3 for data
storage and Amazon Redshift for data warehousing, enabling a comprehensive big
data ecosystem. Additionally, EMR supports various data processing frameworks
and tools, making it suitable for a wide range of use cases, including data
transformation, machine learning, log analysis, and real-time analytics. The
prominent companies that use Amazon EMR are:
● Expedia
● Lyft
● Pfizer
Big data platforms
e. Microsoft Azure HDInsight
● Microsoft Azure HDInsight is a leading big data platform offered by
Microsoft Azure. It provides a fully managed cloud service for processing
and analyzing large datasets using popular open-source frameworks such
as Apache Hadoop, Apache Spark, Apache Hive, and Apache HBase.
HDInsight offers a scalable and reliable infrastructure that allows users to
easily deploy and manage clusters.
● HDInsight integrates seamlessly with other Azure services, such as Azure
Data Lake Storage and Azure Synapse Analytics, offering a comprehensive
ecosystem of Microsoft Azure services. HDInsight supports various
programming languages, including Java, Python, and R, making it accessible
to a wide range of users. The prominent companies that use Microsoft
Azure HDInsight are:
● Starbucks
What is Data Analytics ?
● Data analytics is a multidisciplinary field that employs a wide
range of analysis techniques, including math, statistics, and
computer science, to draw insights from data sets. Data
analytics is a broad term that includes everything from simply
analyzing data to theorizing ways of collecting data and
creating the frameworks needed to store it.
OR
● Data analytics is the science of analyzing raw data to make
conclusions about that information.
Types of data analytics
There are four key types of data analytics: descriptive, diagnostic,
predictive, and prescriptive. Together, these four types of data
analytics can help an organization make data-driven decisions. At
a glance, each of them tells us the following:
● Descriptive analytics tell us what happened.
● Diagnostic analytics tell us why something happened.
● Predictive analytics tell us what will likely happen in the future.
● Prescriptive analytics tell us how to act.
Data analytics case study: Netflix
● Netflix collects all kinds of data from its 163 million global subscribers—
including what users watch and when, what device they use, whether they
pause a show and resume it, how they rate certain content, and exactly what
they search for when looking for something new to watch.
● With the help of data analytics, Netflix are then able to connect all of these
individual data points to create a detailed viewing profile for each user.
Based on key trends and patterns within each user’s viewing behavior, the
recommendation algorithm makes personalized (and pretty spot-on)
suggestions as to what the user might like to watch next.
● This kind of personalized service has a major impact on the user
experience; according to Netflix, over 75% of viewer activity is based on
personalized recommendations. This powerful use of data analytics also
contributes significantly to the success of the business; if you look at their
revenue and usage statistics, you’ll see that Netflix consistently dominates
the global streaming market—and that they’re growing year upon year.
Importance/need of data analytics
Data analytics initiatives can help businesses increase revenue, improve
operational efficiency, optimize marketing campaigns and bolster customer
satisfaction efforts across multiple industries. It can also help organizations
do the following:
● Personalize customer experiences. By going beyond traditional data
methods, data analytics connects insights with actions, enabling businesses
to create personalized customer experiences and develop related digital
products.
● Predict future trends. By using predictive analysis technologies, businesses
can create future-focused products and respond quickly to emerging market
trends, thereby gaining a competitive advantage over business rivals.
Depending on the application, the data that's analyzed can consist of either
historical records or new information that has been processed for real-time
analytics. In addition, it can come from a mix of internal systems and
external data sources.
Importance/need of data analytics
● Reduce operational costs. By optimizing processes and resource allocation, data
analytics can help reduce unnecessary expenses and identify cost-saving
opportunities within the organization.
● Provide risk management. Data analytics lets organizations identify and mitigate
risks by detecting anomalies, fraud and potential compliance issues.
● Improve security. Companies use data analytics methods, such as parsing, analyzing
and visualizing audit logs, to look at past security breaches and find the underlying
vulnerabilities. Data analytics can also be integrated with monitoring and alerting
systems to quickly notify security professionals in the event of a breach attempt.
● Measure performance. Data analytics provide organizations with metrics and key
performance indicators (KPIs) to track progress, monitor performance and evaluate
the success of business initiatives. This helps businesses respond promptly to
changing market conditions and other operational challenges.
Evolution of analytic scalability
Data Analytics tools
1. RapidMiner
● Primary use: Data mining
● RapidMiner is a comprehensive package for data mining and model development.
This platform allows professionals to work with data at many stages, including
preparation, visualization, and review. This can be beneficial for professionals who
have data that isn’t in raw format or that they have mined in the past.
● RapidMiner also offers an array of classification, regression, clustering, and
association rule mining algorithms. While it has some limitations in feature
engineering and selection, it compensates for its limitations with a powerful
graphical programming language.
● This software is suited for people with all types of backgrounds, and you can utilize it
across industries for various applications such as manufacturing, life sciences,
energy, and health care. Because of its ability to work with previously mined data, this
software can be particularly useful if you are a researcher or data scientist working
with historical data.
2. Orange
● Orange is a package renowned for data visualization and analysis, especially
appreciated for its user-friendly, color-coordinated interface. You can find a
comprehensive selection of color-coded widgets for functions like data input,
cleaning, visualization, regression, and clustering, which make it a good choice for
beginners or smaller projects.
● Despite offering fewer tools compared to other platforms, Orange is still an effective
data analysis tool, hosting an array of mainstream algorithms like k-nearest
neighbors, random forests, naive Bayes classification, and support vector machines.
● The platform holds particular value for certain types of professionals with its add-
ons. For example, if you work in bioinformatics and molecular biology, you can find
tools for gene ranking and enrichment analysis. You can also find tools for natural
language processing, text mining, and network analysis that may benefit you
depending on your profession.
3. KNIME
● KNIME, short for KoNstanz Information MinEr, is a free and open-source
data cleaning and analysis tool that makes data mining accessible even if
you are a beginner. Along with data cleaning and analysis software, KNIME
has specialized algorithms for areas like sentiment analysis and social
network analysis. With KNIME, you can integrate data from various sources
into a single analysis and use extensions to work with popular programming
languages like R, Python, Java, and SQL.
● If you are new to data mining, KNIME might be a great choice for you.
Resources on the KNIME platform can help new data professionals learn
about data mining by guiding them through building, deploying, and
maintaining large-scale data mining strategies. Because of this, many
companies use KNIME to help their employees gain data processing and
extraction experience.
4. Tableau
● Primary use: Data visualization and business intelligence
● Tableau stands out as a leading data visualization software,
widely utilized in business analytics and intelligence.
● Tableau is a popular data visualization tool for its easy-to-use
interface and powerful capabilities. Its software can connect
with hundreds of different data sources and manipulate the
information in many different visualization types. It holds a
special appeal for both business users, who appreciate its
simplicity and centralized platform, and data analysts, who can
use more advanced big data tools for tasks such as clustering
and regression.
5. Google Charts
● Primary use: Data visualization
● Google Charts is a free online tool that excels in producing a wide array of
interactive and engaging data visualizations. Its design caters to user-
friendliness, offering a comprehensive selection of pre-set chart types that
can embed into web pages or applications. The versatile nature of Google
Charts allows its integration with a multitude of web platforms, including
iPhone, iPad, and Android, extending its accessibility.
● This tool, with its high customization and user-friendly nature, makes it ideal
if you are looking to create compelling data visuals for web and mobile
platforms. It’s also a great option if you need to publish your charts, as the
integration makes it straightforward for you to publish on most web
platforms by sharing a link or embedding the link into a website’s HTML
6. Microsoft Excel and Power BI
● Primary use: Business intelligence
● Microsoft Excel, fundamentally a spreadsheet software, also has noteworthy data
analytics capabilities. Because of the wide enterprise-level adoption of Microsoft
products, many businesses find they already have access to it.
● You can use Excel to construct at least 20 distinct chart types using spreadsheet
data. These range from standard options such as bar charts and scatter plots to
more complex options like radar charts and treemaps. Excel also has many
streamlined options for businesses to find insights into their data and use modern
business analytics formulas.
● However, Excel does have its boundaries. If your business needs more robust data
visualization tools within the Microsoft ecosystem, Power BI is a great option.
Designed specifically for data analytics and visualization, Power BI can import data
from an array of sources and produce visualizations in various formats.
7. Qlik
● Qlik is a global company designed to help businesses utilize
data for decision-making and problem-solving. It provides
comprehensive, real-time data integration and analytics
solutions to turn data into valuable insights. Qlik’s tools help
businesses understand customer behavior, revamp business
processes, uncover new revenue opportunities, and manage
risk and reward effectively.
8. Google Analytics
● Google Analytics is a tool that helps businesses understand
how people interact with their websites and apps. To use it, you
add a special Javascript code to your web pages. This code
collects information when someone visits your website, like
which pages they see, what device they’re using, and how they
found your site. It then sends this data to Google Analytics,
where it is organized into reports. These reports help you see
patterns, like which products are most popular or which ads are
bringing people to your site.
9. Spotfire
● TIBCO Spotfire is a user-friendly platform that transforms data into
actionable insights. It allows you to analyze historical and real-time
data, predict trends, and visualize results in a single, scalable
platform. Features include custom analytics apps, interactive AI and
data science tools, real-time streaming analytics, and powerful
analytics for location-based data.
● If you are a decision-maker in your organization, such as a marketing
manager or data scientist, you might benefit from Spotfire’s scalable
analytics platform when visually exploring your data.
Step 1 Step 2 Step 3
Defining objectives and Data collection Data cleaning
questions ● Once the objectives and ● Data cleaning, also
● The first step in the questions are defined, known as data
Analytic
data analysis process the next step is to cleansing, is a critical
is to define the collect the relevant step in the data
objectives and data. This can be done analysis process. It
process
formulate clear, specific through various involves checking the
questions that your methods such as data for errors and
analysis aims to surveys, interviews, inconsistencies, and
answer. This step is observations, or correcting or removing
crucial as it sets the extracting from existing them. This step ensures
direction for the entire databases. The data the quality and
process. It involves collected can be reliability of the data,
understanding the quantitative (numerical) which is crucial for
problem or situation at or qualitative (non- obtaining accurate and
hand, identifying the numerical), depending meaningful results from
data needed to address on the nature of the the analysis.
it, and defining the problem and the
metrics or indicators to questions being asked.
measure the outcomes.
Analytic process
Step 4: Data analysis
● Once the data is cleaned, it's time for the actual analysis. This involves applying statistical or
mathematical techniques to the data to discover patterns, relationships, or trends. There are
various tools and software available for this purpose, such as Python, R, Excel, and specialized
software like SPSS and SAS.
Step 5: Data interpretation and visualization
● After the data is analyzed, the next step is to interpret the results and visualize them in a way that
is easy to understand. This could involve creating charts, graphs, or other visual representations of
the data. Data visualization helps to make complex data more understandable and provides a
clear picture of the findings.
Step 6: Data storytelling
● The final step in the data analysis process is data storytelling. This involves presenting the findings
of the analysis in a narrative form that is engaging and easy to understand. Data storytelling is
crucial for communicating the results to non-technical audiences and for making data-driven
decisions.
Analysis vs Reporting
Reporting Analytics
Purpose Summarize and present data for Unearth insights and patterns for
informational purposes. strategic decision-making.
Benefits Enables informed decision-making, In addition, analytics helps you
tracks performance trends, and understand why things are happening
fosters transparency and and know what to do next.
Users accountability.
Primarily operational managers and Primarily data analysts, data
executives. scientists, and executives.
Data presentation Focus on simplicity and clarity, using In addition, analytics may employ
visual aids to convey information advanced statistical methods and
efficiently. models for more in-depth analysis.
Analysis vs reportings
Reporting Analytics
Data source and type Typically relies on structured data May encompass a broader range
from established sources. including unstructured, big data,
and real-time data.
Process Data collection, organization, and Data collection, organization, and
presentation. presentation. Plus, data
exploration, hypothesis testing,
and advanced analysis.
Tool complexity Reporting tools are usually user- Self-service analytics tools are
friendly and straightforward, user-friendly but advanced
making them accessible to a analysis and predictive modeling
wide range of users without can require a higher level of
extensive technical training. technical expertise.
Data analytics applications
Business intelligence and reporting. At the application level, BI and reporting provide organizations with actionable
information about KPIs, business operations, customers and more. In the past, data queries and reports were typically
created for end users by BI developers who worked in IT. Now, more organizations use self-service BI tools that let
executives, business analysts and operational workers run their own ad hoc queries and build reports themselves.
Data mining. Advanced types of data analytics include data mining, which involves sorting through large data sets to
identify trends, patterns and relationships.
Retail. Data analytics can be used in the retail industry to forecast trends, launch new items and increase sales by
understanding customer demands and purchasing patterns.
Machine learning. ML can also be used for data analytics by running automated algorithms to churn through data sets
more quickly than data scientists can do via conventional analytical modeling.
Data analytics applications
Big data analytics. Big data analytics applies data mining, predictive analytics and ML tools to data sets that can include
a mix of structured and unstructured data as well as semi-structured data. Text mining provides a means of analyzing
documents, emails and other text-based content.
Churn forecast. Mobile network operators examine customer data to identify customers who are most likely to not come
back and help organizations retain them.
Marketing. Companies engage in customer relationship management analytics to segment customers for marketing
campaigns and equip call center workers with up-to-date information about callers.
Delivery logistics. Logistics companies such as UPS, DHL and FedEx use data analytics to improve delivery times,
optimize operations and identify the most cost-effective shipping routes and modes of transportation.
Government and public sector. Governments use data analytics for policy formation, resource distribution and for gaining
insights into public needs and requirements.
Key Roles for a Successful
Analytics Project
Figure 2-1 depicts the various roles and key stakeholders of
an analytics project. Each plays a critical part in a
successful analytics project. Although seven roles are
listed, fewer or more people can accomplish the work
depending on the scope of the project, the organizational
structure, and the skills of the participants. For example, on
a small, versatile team, these seven roles may be fulfilled by
only 3 people, but a very large project may require 20 or
more people. The seven roles follow.
Key Roles for a Successful Analytics Project
Key Roles for a Successful Analytics
Project
Business User: Someone who understands the domain area and usually benefits from the
results. This person can consult and advise the project team on the context of the project, the
value of the results, and how the outputs will be operationalized. Usually a business analyst, line
manager, or deep subject matter expert in the project domain fulfills this role.
Project Sponsor: Responsible for the genesis of the project. Provides the impetus and
requirements for the project and defines the core business problem. Generally provides the
funding and gauges the degree of value from the final outputs of the working team. This person
sets the priorities for the project and clarifies the desired outputs.
Project Manager: Ensures that key milestones and objectives are met on time and at the
expected quality.
Key Roles for a Successful Analytics
Project
Business Intelligence Analyst : Provides business domain expertise based on a deep understanding of the data, key performance
indicators (KPis), key metrics, and business intelligence from a reporting perspective. Business Intelligence Analysts generally create
dashboards and reports and have knowledge of the data feeds and sources.
Database Administrator (DBA): Provisions and configures the database environment to support the analytics needs of the working team.
These responsibilities may include providing access to key databases or tables and ensuring the appropriate security levels are in place
related to the data repositories.
Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for data management and data extraction, and provides
support for data ingestion into the analytic sandbox, Whereas the DBA sets up and configures the databases to be used, the data
engineer executes the actual data extractions and performs substantial data manipulation to facilitate the analytics. The data engineer
works closely with the data scientist to help shape data in the right ways for analyses.
Data Scientist: Provides subject matter expertise for analytical techniques, data modeling, and applying valid analytical techniques to
given business problems. Ensures overall analytics objectives are met. Designs and executes analytical methods and approaches with
the data available to the project. Although most of these roles are not new, the last two roles-data engineer and data scientist-have
become popular and in high demand as interest in Big Data has grown.
Data
Analytics
lifecycle
Data Analytics lifecycle
Phase 2- Data preparation: Phase 2
Phase 1- Discovery: In Phase 1, the
requires the presence of an analytic
team learns the business domain,
sandbox, in which the team can work
including relevant history such as
with data and perform analytics for
whether the organization or business
the duration of the project. The team
unit has attempted similar projects in
needs to execute extract, load, and
the past from which they can learn.
transform (ELT) or extract, transform
The team assesses the resources
and load (ETL) to get data into the
available to support the project in
sandbox. The ELT and ETL are
terms of people, technology, time, and
sometimes abbreviated as ETLT. Data
data. Important activities in this phase
should be transformed in the ETLT
include framing the business problem
process so the team can work with it
as an analytics challenge that can be
and analyze it. In this phase, the team
addressed in subsequent phases and
also needs to familiarize itself with the
formulating initial hypotheses (IHs) to
data thoroughly and take steps to
test and begin learning the data.
condition the data (Section 2.3.4).
Data Analytics lifecycle
Phase 4-Model building: In

Phase 4, the team develops
Phase 3-Model planning: data sets for testing, training, Phase 5-Communicate
Phase 3 is model planning, and production purposes. In results: In Phase 5, the team,
where the team determines addition, in this phase the in collaboration with major
Phase 6-0perationalize: In
the methods, techniques, and team builds and executes stakeholders, determines if
Phase 6, the team delivers
workflow it intends to follow models based on the work the results of the project are a
final reports, briefings, code,
for the subsequent model done in the model planning success or a failure based on
and technical documents. In
building phase. The team phase. The team also the criteria developed in Phase
addition, the team may run a
explores the data to learn considers whether its existing 1. The team should identify
pilot project to implement the
about the relationships tools will suffice for running key findings, quantify the
models in a production
between variables and the models, or if it will need a business value, and develop a
environment.
subsequently selects key more robust environment for narrative to summarize and
variables and the most executing models and work convey findings to
suitable models. flows (for example, fast stakeholders.
hardware and parallel
processing, if applicable).
THE END

Data Analytics

Uploaded by

Copyright:

Available Formats

Data Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics

Uploaded by

Copyright:

Available Formats

Data analytics

(i) observation method,

(ii) interview method,

(iii) through questionnaires,

(iv) through schedules, and

method (i) The researcher is enabled to record the natural behaviour of

(b) Telephone interviews: This method of collecting information consists in

This method of data collection is quite popular, particularly in case of big

● Big data refers to extremely large and diverse collections of

Using AI-powered technologies like

Analyzing public datasets of satellite

As its name suggests, the most common characteristic

Big data velocity refers to the speed at which data is

Data a one-stop architecture that solves all

platfor hand. Due to their eﬃciency in data

Phase 4-Model building: In

You might also like