Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Exploring Data Science

Data science is an interdisciplinary field that uses scientific methods, processes,


algorithms and systems to extract knowledge and insights from noisy, structured and
unstrine learning models. It's a creative and important step in the data preprouctured
data.
• It is an analyzing method to extract accurate and deep understanding of raw data
using methods in statistics, Machine Learning etc.
• Different processes included in data science are inspecting, cleaning, transforming,
modeling, analyzing and interpreting raw data.

Exploring data science is an exciting journey into the world of extracting valuable
insights and knowledge from data. It involves a combination of various techniques,
tools, and methodologies to uncover patterns, trends, and relationships within datasets.

Understand the Basics: Start by familiarizing yourself with the fundamental


concepts of data science, including data types, variables, descriptive statistics,
and basic data manipulation. This will give you a solid foundation to build upon.
Learn Programming Languages: Programming is crucial in data science.
Languages like Python and R are widely used for their extensive libraries and
tools designed for data analysis, manipulation, and visualization.
Statistics and Mathematics: A strong understanding of statistics and
mathematics is essential to make informed decisions about data. Concepts like
probability, hypothesis testing, and regression analysis are commonly used in
data science projects.
Data Collection and Cleaning: The quality of your analysis heavily depends on the
quality of your data. Learn how to collect and clean data, dealing with missing
values, outliers, and inconsistencies.
Exploratory Data Analysis (EDA): EDA involves visually exploring your data
through graphs, charts, and summary statistics to gain insights. This step helps
you understand the characteristics of your data and identify potential patterns.

Feature Engineering: Feature engineering involves creating new features or


transforming existing ones to improve the performance of machcessing pipeline.

Machine Learning Algorithms: Explore different machine learning algorithms


such as classification, regression, clustering, and more. Understand when to
apply each algorithm and how to evaluate their performance.
Model Evaluation and Selection: Learn how to assess the performance of your
models using metrics like accuracy, precision, recall, F1-score, and others. This
helps you choose the best model for your specific problem.
Data Visualization: Visualizing data is essential for conveying insights to both
technical and non-technical audiences. Tools like Matplotlib, Seaborn, and Plotly
can help you create meaningful visualizations.
Communication Skills: Being able to explain your findings to non-technical
stakeholders is crucial. Develop your communication skills to present complex
technical concepts in an understandable manner.
Domain Knowledge: Depending on the field you're working in, having domain-
specific knowledge can greatly enhance your ability to extract meaningful
insights from data.
Real-World Projects: Hands-on experience is invaluable. Work on real-world data
science projects to apply what you've learned and build a portfolio showcasing
your skills to potential employers or collaborators.
Stay Updated: Data science is a rapidly evolving field. Stay up-to-date with the
latest trends, tools, and techniques through online courses, blogs, research
papers, and conferences.
Ethical Considerations: Data science often involves sensitive information.
Understand the ethical implications of your work, including privacy concerns and
bias mitigation.

Why Data Science?

Here are significant advantages of using Data Analytics Technology:


● Data is the oil for today’s world. With the right tools, technologies, algorithms, we
can use data and convert it into a distinct business advantage
● Data Science can help you to detect fraud using advanced machine learning
algorithms
● It helps you to prevent any significant monetary losses
● Allows to build intelligence ability in machines
● You can perform sentiment analysis to gauge customer brand loyalty
● It enables you to take better and faster decisions
● It helps you to recommend the right product to the right customer to enhance
your business
Key Component Technologies of Data Science :

Programming Languages:
● Python: Widely used for its simplicity, extensive libraries (such as NumPy,
pandas, scikit-learn, and Matplotlib), and community support in data
analysis, machine learning, and visualization.
● R: Popular for statistical analysis, data manipulation, and visualization,
especially in academia and research.
Data Storage and Databases:
● Relational Databases: Such as MySQL, PostgreSQL, and SQLite, used for
structured data storage and retrieval.
● NoSQL Databases: Like MongoDB, Cassandra, and Redis, suitable for
handling unstructured or semi-structured data.
● Data Warehouses: Solutions like Amazon Redshift, Google BigQuery, and
Snowflake for scalable storage and querying of large datasets.
Big Data Technologies:
● Hadoop: A framework for distributed storage and processing of large
datasets across clusters of computers.
● Spark: An open-source data processing engine that can handle batch
processing, real-time streaming, machine learning, and graph processing.
Data Cleaning and Preprocessing:
● OpenRefine: A tool for cleaning and transforming messy data, handling
inconsistencies, and standardizing formats.
● Trifacta: A platform for data wrangling, enabling efficient data cleaning
and preparation.
Machine Learning Frameworks:
● scikit-learn: A versatile machine learning library for classical algorithms
such as regression, classification, clustering, and dimensionality
reduction.
● TensorFlow: An open-source library developed by Google for building and
training neural network models.
● PyTorch: A popular deep learning framework known for its dynamic
computational graph and research-friendly design.
Visualization Tools:
● Matplotlib: A 2D plotting library for creating static, interactive, and
animated visualizations in Python.
● Seaborn: A higher-level visualization library built on top of Matplotlib,
providing more aesthetically pleasing and informative plots.
● Tableau: A powerful data visualization tool that allows users to create
interactive and shareable dashboards and reports.

Statistical Analysis Tools:


● RStudio: An integrated development environment (IDE) for R that
facilitates data analysis, visualization, and statistical modeling.
● Jupyter Notebooks: Interactive environments for creating and sharing
documents that combine live code, equations, visualizations, and
explanatory text.
Cloud Computing Platforms:
● Amazon Web Services (AWS): Offers various services for data storage,
processing, and analysis, including S3 for storage and EC2 for compute
resources.
● Google Cloud Platform (GCP): Provides tools like BigQuery, Dataflow, and
AI Platform for data-related tasks.
● Microsoft Azure: Offers services such as Azure Data Lake, Azure Machine
Learning, and Azure Databricks.
Natural Language Processing (NLP) Libraries:
● NLTK (Natural Language Toolkit): A comprehensive library for NLP tasks
such as tokenization, stemming, tagging, parsing, and more.
● spaCy: A modern NLP library with pre-trained models and efficient
processing capabilities.
Version Control Systems:
● Git: Used for tracking changes in code, collaborating with others, and
maintaining a history of project development.
APIs and Web Scraping:
● Requests: A Python library for making HTTP requests to APIs and web
pages.
● Beautiful Soup: A library for web scraping and parsing HTML and XML
documents.

These component technologies provide the building blocks for various stages of a data

science project, from data collection and cleaning to analysis, modeling, visualization,

and deployment. Depending on the specific project requirements and goals, data

scientists may use a combination of these technologies to effectively work with data

and generate meaningful insights.


Data Science Life Cycle

1. Business Understanding
The data scientists in the room are the people who keep asking the why’s. They’re the
people who want to ensure that every decision made in the company is supported by
concrete data, and that it is guaranteed (with a high probability) to achieve results.
Before you can even start on a data science project, it is critical that you understand the
problem you are trying to solve.

1. How much or how many? (regression)


2. Which category? (classification)
3. Which group? (clustering)
4. Is this weird? (anomaly detection)
5. Which option should be taken? (recommendation)

2. Data Mining
Data mining is the process of gathering your data from different sources. It is a process
that involves discovering patterns, relationships, and insights from large datasets using
various techniques and methods. It's a subset of the broader field of data science and is
focused on extracting valuable information from data to support decision-making,
predictions, and knowledge discovery. Data mining is often used to uncover hidden
patterns that might not be immediately apparent through simple data analysis.
3. Data Cleaning
Data cleaning, also referred to as data cleansing or data preprocessing, is a crucial step
in the data science lifecycle. It involves identifying and correcting errors,
inconsistencies, and inaccuracies in the raw data to ensure that the data is suitable for
analysis and modeling. Proper data cleaning helps improve the quality of the data,
leading to more accurate and reliable insights.

Handling Missing Values


Dealing with Outliers
Correcting Inconsistencies
Handling Duplicates:
Normalization and Scaling
Encoding Categorical Variables

4. Data Exploration
Data exploration, also known as exploratory data analysis (EDA), is a crucial preliminary
step in the data science process. It involves visually and statistically summarizing,
analyzing, and understanding the characteristics of a dataset. The primary goal of data
exploration is to gain insights, uncover patterns, identify anomalies, and guide the
subsequent stages of data analysis and modeling.
Descriptive Statistics
Data Visualization
Univariate Analysis
Bivariate Analysis
Multivariate Analysis

5. Feature Engineering
Feature engineering is a crucial step in the data preprocessing phase of a data science
project. It involves creating new features (variables) from the existing ones or
transforming existing features to improve the performance of machine learning models.
Effective feature engineering can significantly enhance the predictive power and
generalization ability of models
Feature Creation
Domain Knowledge
Feature Selection
Feature Extraction
Binning/Bucketing
Scaling and Normalization

6. Predictive Modeling
Predictive modeling is a core aspect of data science and machine learning. It involves
using historical data to build models that can make predictions about future events or
outcomes. These models learn patterns and relationships within the data and then apply
that learning to make predictions on new, unseen data. Predictive modeling is used in
various fields and applications, including finance, healthcare, marketing,
recommendation systems, and more.
Data Collection and Preparation
Data Splitting
Model Selection
Model Training
Model Evaluation
Hyperparameter Tuning
7. Data Visualization
Data visualization is the graphical representation of data to communicate insights,
trends, patterns, and relationships in a visual and easily understandable format. It plays
a crucial role in data analysis and communication by turning raw data into meaningful
visuals that facilitate understanding and decision-making. Effective data visualization
can simplify complex data, highlight key points, and reveal insights that might not be
apparent from raw data alone.

Big Data:
Big Data refers to the massive volumes of data that exceed the capacity of traditional
data processing tools to efficiently capture, store, manage, and analyze. It involves the
three "Vs": volume, velocity, and variety. Big Data encompasses a wide range of data
types, including structured, semi-structured, and unstructured data, and it is generated
at a high speed from various sources like social media, sensors, logs, and more.

Data Science plays a crucial role in making sense of Big Data. Here's how they are
connected:
Data Processing and Analysis: Data Science methods are used to analyze Big
Data and derive meaningful insights from it. As Big Data often involves
unstructured or semi-structured data, advanced techniques are needed to
process and make sense of the information within it.
Predictive Analytics: Data Science techniques, such as machine learning, are
applied to Big Data to build predictive models that can forecast future trends,
behaviors, and outcomes based on historical data patterns.
Real-Time Insights: Big Data generated at high velocity requires real-time or
near-real-time analysis to make timely decisions. Data Science methods,
particularly stream processing and real-time analytics, help extract insights from
streaming data.
Feature Engineering: In the context of Big Data, feature engineering is critical for
improving the performance of machine learning models. Data scientists work on
creating relevant and informative features to enhance model accuracy.
Dimensionality Reduction: As Big Data often involves a large number of features,
dimensionality reduction techniques are used in Data Science to extract the most
relevant information and reduce computational complexity.
Scalable Algorithms: Data Science techniques are adapted or developed to work
efficiently with large datasets. Scalable machine learning algorithms are used to
process and analyze Big Data within a reasonable time frame.
Data Visualization: Data Science involves visualizing data to communicate
insights effectively. In the context of Big Data, creating visualizations that provide
clear representations of complex information becomes even more important.
Resource Management: When dealing with Big Data, resource management and
optimization are crucial. Data Science methods help allocate computational
resources effectively to process and analyze the data efficiently.

Business Intelligence (BI) refers to a set of technologies, processes, and tools that help

organizations collect, analyze, and present business-related data to support informed decision-

making. BI enables companies to turn raw data into actionable insights, empowering them to

make strategic, operational, and tactical decisions that drive business growth and efficiency.

Here are some key aspects of business intelligence:

Data Collection and Integration: BI involves gathering data from various sources within
an organization, such as databases, spreadsheets, CRM systems, ERP systems, and
external data sources. This data is then integrated and transformed into a unified format
for analysis.
Data Warehousing: A data warehouse is a centralized repository that stores historical
and current data from different sources. It enables organizations to have a single source
of truth for reporting and analysis.
Data Analysis and Reporting: BI tools allow users to analyze data through ad hoc
queries, interactive dashboards, and predefined reports. Users can explore data, identify
trends, and gain insights into the business's performance.
Data Visualization: Visual representations such as charts, graphs, maps, and
infographics make complex data more understandable and help convey insights to non-
technical stakeholders.
Dashboard Creation: Dashboards provide a visual overview of key performance
indicators (KPIs) and metrics relevant to the business. They offer a real-time or near-
real-time snapshot of business performance.
OLAP (Online Analytical Processing): OLAP tools allow users to explore
multidimensional data by slicing, dicing, and drilling down into data cubes. This helps
users analyze data from different perspectives.
Data Mining and Predictive Analytics: BI tools can use historical data to identify
patterns and make predictions about future trends. This is particularly valuable for
forecasting demand, customer behavior, and market trends.
Self-Service BI: Self-service BI empowers non-technical users to create their own
reports and perform data analysis without relying on IT or data analysts. This reduces
bottlenecks and accelerates decision-making.
Mobile BI: With the rise of mobile devices, BI tools have adapted to provide insights on
smartphones and tablets, enabling decision-makers to access critical information on the
go.
Data Governance and Security: BI systems need to ensure data accuracy, consistency,
and security. Access controls and user permissions are crucial to protect sensitive
information.
Integration with Machine Learning and AI: Some advanced BI tools integrate with
machine learning and AI algorithms to enhance predictive analytics and automate
decision-making processes.
Cloud-Based BI: Cloud-based BI platforms allow organizations to access and analyze
data from anywhere, providing scalability and cost-effectiveness.
Collaboration and Sharing: BI tools facilitate collaboration by enabling users to share
reports, dashboards, and insights with colleagues, fostering a data-driven culture.
Continuous Monitoring and Improvement: BI is an iterative process. Organizations
continually monitor KPIs, gather feedback, and refine their strategies based on the
insights gained.

In summary, business intelligence is a multidisciplinary approach that combines technology,

data analysis, and business expertise to help organizations make smarter decisions. By
leveraging BI tools and practices, businesses can optimize their operations, improve customer

satisfaction, identify growth opportunities, and stay competitive in an ever-evolving market.

Microsoft Excel is a widely used spreadsheet application that can be a valuable tool for

certain aspects of data science, especially for beginners or for simple data analysis

tasks. While it might not offer the same level of sophistication as specialized data

science tools and programming languages, Excel can be useful for data exploration,

visualization, and basic analysis. Here are some ways in which MS Excel can be used in

data science:

Data Cleaning and Preparation: Excel provides features for data cleaning, such

as removing duplicates, filling missing values, and basic data transformations.

It's suitable for smaller datasets where manual cleaning is manageable.

Exploratory Data Analysis (EDA): You can use Excel to generate basic summary

statistics, histograms, and charts to gain initial insights into your data.

PivotTables and PivotCharts can be useful for exploring relationships within the

data.

Data Visualization: Excel offers a variety of chart types, making it possible to

create simple visualizations for sharing insights with non-technical stakeholders.

Basic Statistical Analysis: Excel provides functions for calculating basic

statistics like mean, median, standard deviation, and correlations. You can

perform simple statistical tests and calculations.

What-If Analysis: Excel's scenario manager and goal seek features can help you

perform "what-if" analyses, allowing you to understand how changes in variables

impact outcomes.

Regression Analysis: You can use Excel's built-in regression analysis tool to

perform linear regression and analyze relationships between variables.


Time Series Analysis: Excel can be used for basic time series analysis, including

creating time series plots and calculating moving averages.

Basic Data Mining: Excel offers features like data filtering, sorting, and

conditional formatting, which can help you explore patterns and trends in your

data.

Despite these advantages, there are limitations to using Excel in data science:

● Scalability: Excel is not designed to handle very large datasets efficiently. As

data sizes grow, Excel's performance might degrade significantly.

● Complex Analysis: For advanced statistical analysis, machine learning, and more

complex tasks, you might find Excel limited compared to dedicated data science

tools and programming languages.

● Reproducibility and Automation: Excel lacks the ability to easily script and

automate processes, making it challenging to reproduce analyses or create

reusable workflows.

● Customization: While Excel provides standard functions and tools, it might not

be flexible enough to accommodate custom analyses or specialized algorithms.

For more advanced data science tasks, consider using dedicated data science tools and

programming languages like Python (with libraries like pandas, NumPy, scikit-learn, etc.)

or R. These languages offer more robust capabilities for data manipulation, analysis,

and modeling, and they are widely used in the data science community. However, for

beginners or for quick exploratory tasks, Excel can serve as a useful starting point in

your data science journey.


Python is one of the most popular programming languages in the field of data science

due to its versatility, rich ecosystem of libraries, and ease of use. It's widely used for

various data-related tasks, including data analysis, machine learning, data visualization,

and more. Here's how Python is used in data science:

Data Manipulation and Analysis: Python's pandas library provides powerful tools

for data manipulation, transformation, and analysis. It offers data structures like

DataFrame that make it easy to work with structured data.

Data Visualization: Libraries like Matplotlib, Seaborn, and Plotly allow you to

create a wide range of visualizations to help you explore and communicate

insights from your data.

Statistical Analysis: Python provides libraries for statistical analysis, including

scipy and statsmodels, which enable you to perform hypothesis testing, ANOVA,

regression analysis, and more.

Machine Learning: Python has robust machine learning libraries like scikit-learn,

TensorFlow, and PyTorch. These libraries provide implementations of various

algorithms for classification, regression, clustering, and deep learning.

Natural Language Processing (NLP): Libraries like NLTK and spaCy enable you

to work with text data, perform tasks like text classification, sentiment analysis,

and entity recognition.

Web Scraping: Python's libraries like Beautiful Soup and Requests allow you to

extract data from websites and APIs, which can be useful for collecting data for

analysis.

Time Series Analysis: Libraries like pandas and statsmodels offer tools for

handling time series data, performing decomposition, forecasting, and more.


Geospatial Analysis: Libraries like geopandas and folium enable geospatial data

manipulation, visualization, and analysis.

Data Preprocessing: Python provides functions for data cleaning, transformation,

and feature engineering, which are essential steps before applying machine

learning algorithms.

Interactive Notebooks: Jupyter Notebooks provide an interactive environment

where you can combine code, visualizations, and explanations. They're popular

for documenting and sharing data science workflows.

Community and Resources: Python's strong data science community means

you'll find plenty of resources, tutorials, and libraries to help you learn and apply

data science techniques effectively.

R is a programming language and environment that is widely used in the field of data

science and statistics. It was designed specifically for data analysis and statistical

modeling, making it a powerful tool for various data-related tasks. Here's how R is used

in data science:

Data Manipulation and Analysis: R offers a rich ecosystem of packages, with the

core library known as "base R" providing functions for data manipulation,

transformation, and analysis. Additionally, the tidyverse collection of packages,

including dplyr and tidyr, provides a more user-friendly syntax for data

manipulation.

Statistical Analysis: R is renowned for its statistical capabilities. It provides a

wide range of built-in statistical functions and packages like statsmodels and

caret that allow you to perform various types of statistical analyses, hypothesis

testing, linear and nonlinear modeling, and more.


Data Visualization: The ggplot2 package in R is highly regarded for creating

customizable and publication-quality visualizations. It follows a grammar of

graphics approach, making it easy to create complex plots and visualizations.

Machine Learning: R has a growing ecosystem of machine learning packages,

including caret, randomForest, xgboost, and more. These packages provide

implementations of algorithms for classification, regression, clustering, and

more.

Time Series Analysis: R provides specialized packages like forecast and tseries

for time series analysis, allowing you to perform tasks like decomposition,

forecasting, and anomaly detection.

Text Mining and Natural Language Processing (NLP): R offers packages like tm

and quanteda for text mining and NLP tasks, including text preprocessing,

sentiment analysis, topic modeling, and more.

Interactive Data Exploration: R Markdown and Shiny allow you to create

interactive documents, reports, and web applications that integrate code,

visualizations, and explanatory text.

Community and Resources: R has a strong and active community of data

scientists and statisticians, resulting in a wealth of resources, tutorials, and

packages available for various data science tasks.

Data Preprocessing and Cleaning: R provides functions and packages for data

cleaning, transformation, and feature engineering, similar to Python's pandas

library.

Advanced Statistics: R excels in providing a wide range of advanced statistical

techniques and specialized packages for fields like econometrics, bioinformatics,

and social sciences.


R's emphasis on statistics and data analysis makes it a preferred choice for

researchers, statisticians, and professionals in fields that require in-depth statistical

understanding. Its extensive library ecosystem, along with its capabilities in data

manipulation, analysis, and visualization, makes it a valuable tool for anyone working on

data-related tasks and analyses.

Hadoop is an open-source framework designed for distributed storage and processing

of large volumes of data across clusters of commodity hardware. It provides a scalable

and fault-tolerant solution for handling Big Data, which includes massive amounts of

structured, semi-structured, and unstructured data. Hadoop's architecture is built

around two core components: Hadoop Distributed File System (HDFS) for storage and

MapReduce for processing.

Here's an overview of the key components and concepts of Hadoop:

Hadoop Distributed File System (HDFS): HDFS is a distributed file system

designed to store very large files across multiple machines while providing fault

tolerance. Data is divided into blocks and replicated across nodes in the cluster

to ensure reliability. HDFS is optimized for handling large files, making it suitable

for Big Data storage.

MapReduce: MapReduce is a programming model and processing engine for

distributed data processing. It simplifies the processing of large datasets by

breaking down tasks into smaller subtasks that can be executed in parallel

across the cluster. The Map phase involves processing data and emitting key-

value pairs, while the Reduce phase aggregates and summarizes the results.

YARN (Yet Another Resource Negotiator): YARN is the resource management

layer of Hadoop that enables efficient sharing of cluster resources among


different applications. It manages resources and schedules tasks, allowing

multiple applications to run simultaneously on the same Hadoop cluster.

Hadoop Ecosystem: Hadoop's ecosystem includes various additional tools and

projects that extend its capabilities. Some notable components include:

● Hive: A data warehouse infrastructure that provides a SQL-like language

for querying and managing large datasets stored in HDFS.

● Pig: A high-level platform for creating MapReduce programs using a

scripting language.

● HBase: A distributed, scalable NoSQL database that can store and

manage large amounts of sparse data.

● Spark: A data processing engine that offers faster processing than

traditional MapReduce by utilizing in-memory computations.

● Impala: A query engine that provides real-time interactive SQL queries on

Hadoop data.

● Sqoop: A tool for transferring data between Hadoop and relational

databases.

● Flume and Kafka: Tools for collecting, aggregating, and moving data into

Hadoop from various sources.

Hadoop's distributed storage and processing capabilities provide a powerful framework

for managing Big Data, while data science techniques extract valuable insights from the

data.
SQL databases and data science are closely connected, as SQL databases serve as a

fundamental source of structured data for many data science projects. SQL (Structured

Query Language) databases provide a structured and efficient way to store, manage,

and retrieve data, which can then be used for various data analysis and machine

learning tasks.

SQL databases provide a foundational layer for data science by offering structured data

storage, retrieval, and manipulation capabilities. Data scientists often use SQL queries

to prepare and extract valuable insights from data before applying more advanced

analysis techniques and machine learning algorithms.

Loading data into R is a fundamental step in the data analysis process. R provides

several functions and methods to read data from various file formats and sources. Here

are some common ways to load data into R:

Reading CSV Files: CSV (Comma-Separated Values) files are a widely used format for

storing tabular data. You can use the read.csv() function to read data from a CSV file

data <- read.csv("data.csv")

Overview of the data science process


The typical data science process consists of six steps through which you’ll iterate, as
shown in figure.
1. The first step of this process is setting a research goal. The main purpose here
is making sure all the stakeholders understand the what, how, and why of the
project. In every serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for
analysis, so this step includes finding suitable data and getting access to the
data from the data owner. The result is data in its raw form, which probably
needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes
transforming the data from a raw form into data that’s directly usable in your
models. To achieve this, you’ll detect and correct different kinds of errors in
the data, combine data from different data sources, and transform it. If you
have successfully completed this step, you can progress to data visualization
and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and
deviations based on visual and descriptive techniques. The insights you gain
from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make
the predictions stated in your project charter. Now is the time to bring out the
heavy guns, but remember research has taught us that often (but not always)
a combination of simple models tends to outperform one complicated model.
If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and
automating the analysis, if needed. One goal of a project is to change a process
and/or make better decisions. You may still need to convince the business that
your findings will indeed change the business process as expected. This is
where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Certain projects
require you to perform the business process over and over again, so
automating the project will save time.

Difference Between Data Science with BI (Business Intelligence)

Parameters Business Intelligence Data Science

Perception Looking Backward Looking Forward


Data Structured Data. Mostly SQL, but some Structured and Unstructured
Sources time Data Warehouse) data.
Like logs, SQL, NoSQL, or
text

Approach Statistics & Visualization Statistics, Machine Learning,


and Graph

Emphasis Past & Present Analysis & Neuro-linguistic


Programming

Tools Pentaho. Microsoft Bl, QlikView, R, TensorFlow

Applications of Data Science

Some application of Data Science are:


Internet Search:
Google search uses Data science technology to search for a specific result within a
fraction of a second
Recommendation Systems:
To create a recommendation system. For example, “suggested friends” on Facebook or
suggested videos” on YouTube, everything is done with the help of Data Science.
Image & Speech Recognition:
Speech recognizes systems like Siri, Google Assistant, and Alexa run on the Data
science technique. Moreover, Facebook recognizes your friend when you upload a
photo with them, with the help of Data Science.
Gaming world:
EA Sports, Sony, Nintendo are using Data science technology. This enhances your
gaming experience. Games are now developed using Machine Learning techniques,
and they can update themselves when you move to higher levels.
Online Price Comparison:
PriceRunner, Junglee, Shopzilla work on the Data science mechanism. Here, data is
fetched from the relevant websites using APIs.
Challenges of Data Science Technology

● A high variety of information & data is required for accurate analysis


● Not adequate data science talent pool available
● Management does not provide financial support for a data science team
● Unavailability of/difficult access to data
● Business decision-makers do not effectively use data Science results
● Explaining data science to others is difficult
● Privacy issues
● Lack of significant domain expert
● If an organization is very small, it can’t have a Data Science team

You might also like