Unit 1

Exploring Data Science
Data science is an interdisciplinary field that uses scientific methods, processes,

algorithms and systems to extract knowledge and insights from noisy, structured and
unstrine learning models. It's a creative and important step in the data preprouctured
data.
• It is an analyzing method to extract accurate and deep understanding of raw data
using methods in statistics, Machine Learning etc.
• Different processes included in data science are inspecting, cleaning, transforming,
modeling, analyzing and interpreting raw data.
Exploring data science is an exciting journey into the world of extracting valuable
insights and knowledge from data. It involves a combination of various techniques,
tools, and methodologies to uncover patterns, trends, and relationships within datasets.
Understand the Basics: Start by familiarizing yourself with the fundamental

concepts of data science, including data types, variables, descriptive statistics,
and basic data manipulation. This will give you a solid foundation to build upon.
Learn Programming Languages: Programming is crucial in data science.
Languages like Python and R are widely used for their extensive libraries and
tools designed for data analysis, manipulation, and visualization.
Statistics and Mathematics: A strong understanding of statistics and
mathematics is essential to make informed decisions about data. Concepts like
probability, hypothesis testing, and regression analysis are commonly used in
data science projects.
Data Collection and Cleaning: The quality of your analysis heavily depends on the
quality of your data. Learn how to collect and clean data, dealing with missing
values, outliers, and inconsistencies.
Exploratory Data Analysis (EDA): EDA involves visually exploring your data
through graphs, charts, and summary statistics to gain insights. This step helps
you understand the characteristics of your data and identify potential patterns.
Feature Engineering: Feature engineering involves creating new features or

transforming existing ones to improve the performance of machcessing pipeline.
Machine Learning Algorithms: Explore different machine learning algorithms

such as classification, regression, clustering, and more. Understand when to
apply each algorithm and how to evaluate their performance.
Model Evaluation and Selection: Learn how to assess the performance of your
models using metrics like accuracy, precision, recall, F1-score, and others. This
helps you choose the best model for your specific problem.
Data Visualization: Visualizing data is essential for conveying insights to both
technical and non-technical audiences. Tools like Matplotlib, Seaborn, and Plotly
can help you create meaningful visualizations.
Communication Skills: Being able to explain your findings to non-technical
stakeholders is crucial. Develop your communication skills to present complex
technical concepts in an understandable manner.
Domain Knowledge: Depending on the field you're working in, having domain-
specific knowledge can greatly enhance your ability to extract meaningful
insights from data.
Real-World Projects: Hands-on experience is invaluable. Work on real-world data
science projects to apply what you've learned and build a portfolio showcasing
your skills to potential employers or collaborators.
Stay Updated: Data science is a rapidly evolving field. Stay up-to-date with the
latest trends, tools, and techniques through online courses, blogs, research
papers, and conferences.
Ethical Considerations: Data science often involves sensitive information.
Understand the ethical implications of your work, including privacy concerns and
bias mitigation.
Why Data Science?
Here are significant advantages of using Data Analytics Technology:

● Data is the oil for today’s world. With the right tools, technologies, algorithms, we
can use data and convert it into a distinct business advantage
● Data Science can help you to detect fraud using advanced machine learning
algorithms
● It helps you to prevent any significant monetary losses
● Allows to build intelligence ability in machines
● You can perform sentiment analysis to gauge customer brand loyalty
● It enables you to take better and faster decisions
● It helps you to recommend the right product to the right customer to enhance
your business
Key Component Technologies of Data Science :
Programming Languages:
● Python: Widely used for its simplicity, extensive libraries (such as NumPy,
pandas, scikit-learn, and Matplotlib), and community support in data
analysis, machine learning, and visualization.
● R: Popular for statistical analysis, data manipulation, and visualization,
especially in academia and research.
Data Storage and Databases:
● Relational Databases: Such as MySQL, PostgreSQL, and SQLite, used for
structured data storage and retrieval.
● NoSQL Databases: Like MongoDB, Cassandra, and Redis, suitable for
handling unstructured or semi-structured data.
● Data Warehouses: Solutions like Amazon Redshift, Google BigQuery, and
Snowflake for scalable storage and querying of large datasets.
Big Data Technologies:
● Hadoop: A framework for distributed storage and processing of large
datasets across clusters of computers.
● Spark: An open-source data processing engine that can handle batch
processing, real-time streaming, machine learning, and graph processing.
Data Cleaning and Preprocessing:
● OpenRefine: A tool for cleaning and transforming messy data, handling
inconsistencies, and standardizing formats.
● Trifacta: A platform for data wrangling, enabling efficient data cleaning
and preparation.
Machine Learning Frameworks:
● scikit-learn: A versatile machine learning library for classical algorithms
such as regression, classification, clustering, and dimensionality
reduction.
● TensorFlow: An open-source library developed by Google for building and
training neural network models.
● PyTorch: A popular deep learning framework known for its dynamic
computational graph and research-friendly design.
Visualization Tools:
● Matplotlib: A 2D plotting library for creating static, interactive, and
animated visualizations in Python.
● Seaborn: A higher-level visualization library built on top of Matplotlib,
providing more aesthetically pleasing and informative plots.
● Tableau: A powerful data visualization tool that allows users to create
interactive and shareable dashboards and reports.
Statistical Analysis Tools:

● RStudio: An integrated development environment (IDE) for R that
facilitates data analysis, visualization, and statistical modeling.
● Jupyter Notebooks: Interactive environments for creating and sharing
documents that combine live code, equations, visualizations, and
explanatory text.
Cloud Computing Platforms:
● Amazon Web Services (AWS): Offers various services for data storage,
processing, and analysis, including S3 for storage and EC2 for compute
resources.
● Google Cloud Platform (GCP): Provides tools like BigQuery, Dataflow, and
AI Platform for data-related tasks.
● Microsoft Azure: Offers services such as Azure Data Lake, Azure Machine
Learning, and Azure Databricks.
Natural Language Processing (NLP) Libraries:
● NLTK (Natural Language Toolkit): A comprehensive library for NLP tasks
such as tokenization, stemming, tagging, parsing, and more.
● spaCy: A modern NLP library with pre-trained models and efficient
processing capabilities.
Version Control Systems:
● Git: Used for tracking changes in code, collaborating with others, and
maintaining a history of project development.
APIs and Web Scraping:
● Requests: A Python library for making HTTP requests to APIs and web
pages.
● Beautiful Soup: A library for web scraping and parsing HTML and XML
documents.
These component technologies provide the building blocks for various stages of a data
science project, from data collection and cleaning to analysis, modeling, visualization,
and deployment. Depending on the specific project requirements and goals, data
scientists may use a combination of these technologies to effectively work with data
and generate meaningful insights.

Data Science Life Cycle
1. Business Understanding
The data scientists in the room are the people who keep asking the why’s. They’re the
people who want to ensure that every decision made in the company is supported by
concrete data, and that it is guaranteed (with a high probability) to achieve results.
Before you can even start on a data science project, it is critical that you understand the
problem you are trying to solve.
1. How much or how many? (regression)

2. Which category? (classification)
3. Which group? (clustering)
4. Is this weird? (anomaly detection)
5. Which option should be taken? (recommendation)
2. Data Mining
Data mining is the process of gathering your data from different sources. It is a process
that involves discovering patterns, relationships, and insights from large datasets using
various techniques and methods. It's a subset of the broader field of data science and is
focused on extracting valuable information from data to support decision-making,
predictions, and knowledge discovery. Data mining is often used to uncover hidden
patterns that might not be immediately apparent through simple data analysis.
3. Data Cleaning
Data cleaning, also referred to as data cleansing or data preprocessing, is a crucial step
in the data science lifecycle. It involves identifying and correcting errors,
inconsistencies, and inaccuracies in the raw data to ensure that the data is suitable for
analysis and modeling. Proper data cleaning helps improve the quality of the data,
leading to more accurate and reliable insights.
Handling Missing Values

Dealing with Outliers
Correcting Inconsistencies
Handling Duplicates:
Normalization and Scaling
Encoding Categorical Variables
4. Data Exploration
Data exploration, also known as exploratory data analysis (EDA), is a crucial preliminary
step in the data science process. It involves visually and statistically summarizing,
analyzing, and understanding the characteristics of a dataset. The primary goal of data
exploration is to gain insights, uncover patterns, identify anomalies, and guide the
subsequent stages of data analysis and modeling.
Descriptive Statistics
Data Visualization
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
5. Feature Engineering
Feature engineering is a crucial step in the data preprocessing phase of a data science
project. It involves creating new features (variables) from the existing ones or
transforming existing features to improve the performance of machine learning models.
Effective feature engineering can significantly enhance the predictive power and
generalization ability of models
Feature Creation
Domain Knowledge
Feature Selection
Feature Extraction
Binning/Bucketing
Scaling and Normalization
6. Predictive Modeling
Predictive modeling is a core aspect of data science and machine learning. It involves
using historical data to build models that can make predictions about future events or
outcomes. These models learn patterns and relationships within the data and then apply
that learning to make predictions on new, unseen data. Predictive modeling is used in
various fields and applications, including finance, healthcare, marketing,
recommendation systems, and more.
Data Collection and Preparation
Data Splitting
Model Selection
Model Training
Model Evaluation
Hyperparameter Tuning
7. Data Visualization
Data visualization is the graphical representation of data to communicate insights,
trends, patterns, and relationships in a visual and easily understandable format. It plays
a crucial role in data analysis and communication by turning raw data into meaningful
visuals that facilitate understanding and decision-making. Effective data visualization
can simplify complex data, highlight key points, and reveal insights that might not be
apparent from raw data alone.
Big Data:
Big Data refers to the massive volumes of data that exceed the capacity of traditional
data processing tools to efficiently capture, store, manage, and analyze. It involves the
three "Vs": volume, velocity, and variety. Big Data encompasses a wide range of data
types, including structured, semi-structured, and unstructured data, and it is generated
at a high speed from various sources like social media, sensors, logs, and more.
Data Science plays a crucial role in making sense of Big Data. Here's how they are
connected:
Data Processing and Analysis: Data Science methods are used to analyze Big
Data and derive meaningful insights from it. As Big Data often involves
unstructured or semi-structured data, advanced techniques are needed to
process and make sense of the information within it.
Predictive Analytics: Data Science techniques, such as machine learning, are
applied to Big Data to build predictive models that can forecast future trends,
behaviors, and outcomes based on historical data patterns.
Real-Time Insights: Big Data generated at high velocity requires real-time or
near-real-time analysis to make timely decisions. Data Science methods,
particularly stream processing and real-time analytics, help extract insights from
streaming data.
Feature Engineering: In the context of Big Data, feature engineering is critical for
improving the performance of machine learning models. Data scientists work on
creating relevant and informative features to enhance model accuracy.
Dimensionality Reduction: As Big Data often involves a large number of features,
dimensionality reduction techniques are used in Data Science to extract the most
relevant information and reduce computational complexity.
Scalable Algorithms: Data Science techniques are adapted or developed to work
efficiently with large datasets. Scalable machine learning algorithms are used to
process and analyze Big Data within a reasonable time frame.
Data Visualization: Data Science involves visualizing data to communicate
insights effectively. In the context of Big Data, creating visualizations that provide
clear representations of complex information becomes even more important.
Resource Management: When dealing with Big Data, resource management and
optimization are crucial. Data Science methods help allocate computational
resources effectively to process and analyze the data efficiently.
Business Intelligence (BI) refers to a set of technologies, processes, and tools that help
organizations collect, analyze, and present business-related data to support informed decision-
making. BI enables companies to turn raw data into actionable insights, empowering them to
make strategic, operational, and tactical decisions that drive business growth and efficiency.
Here are some key aspects of business intelligence:
Data Collection and Integration: BI involves gathering data from various sources within
an organization, such as databases, spreadsheets, CRM systems, ERP systems, and
external data sources. This data is then integrated and transformed into a unified format
for analysis.
Data Warehousing: A data warehouse is a centralized repository that stores historical
and current data from different sources. It enables organizations to have a single source
of truth for reporting and analysis.
Data Analysis and Reporting: BI tools allow users to analyze data through ad hoc
queries, interactive dashboards, and predefined reports. Users can explore data, identify
trends, and gain insights into the business's performance.
Data Visualization: Visual representations such as charts, graphs, maps, and
infographics make complex data more understandable and help convey insights to non-
technical stakeholders.
Dashboard Creation: Dashboards provide a visual overview of key performance
indicators (KPIs) and metrics relevant to the business. They offer a real-time or near-
real-time snapshot of business performance.
OLAP (Online Analytical Processing): OLAP tools allow users to explore
multidimensional data by slicing, dicing, and drilling down into data cubes. This helps
users analyze data from different perspectives.
Data Mining and Predictive Analytics: BI tools can use historical data to identify
patterns and make predictions about future trends. This is particularly valuable for
forecasting demand, customer behavior, and market trends.
Self-Service BI: Self-service BI empowers non-technical users to create their own
reports and perform data analysis without relying on IT or data analysts. This reduces
bottlenecks and accelerates decision-making.
Mobile BI: With the rise of mobile devices, BI tools have adapted to provide insights on
smartphones and tablets, enabling decision-makers to access critical information on the
go.
Data Governance and Security: BI systems need to ensure data accuracy, consistency,
and security. Access controls and user permissions are crucial to protect sensitive
information.
Integration with Machine Learning and AI: Some advanced BI tools integrate with
machine learning and AI algorithms to enhance predictive analytics and automate
decision-making processes.
Cloud-Based BI: Cloud-based BI platforms allow organizations to access and analyze
data from anywhere, providing scalability and cost-effectiveness.
Collaboration and Sharing: BI tools facilitate collaboration by enabling users to share
reports, dashboards, and insights with colleagues, fostering a data-driven culture.
Continuous Monitoring and Improvement: BI is an iterative process. Organizations
continually monitor KPIs, gather feedback, and refine their strategies based on the
insights gained.
In summary, business intelligence is a multidisciplinary approach that combines technology,
data analysis, and business expertise to help organizations make smarter decisions. By
leveraging BI tools and practices, businesses can optimize their operations, improve customer
satisfaction, identify growth opportunities, and stay competitive in an ever-evolving market.
Microsoft Excel is a widely used spreadsheet application that can be a valuable tool for
certain aspects of data science, especially for beginners or for simple data analysis
tasks. While it might not offer the same level of sophistication as specialized data
science tools and programming languages, Excel can be useful for data exploration,
visualization, and basic analysis. Here are some ways in which MS Excel can be used in
data science:
Data Cleaning and Preparation: Excel provides features for data cleaning, such
as removing duplicates, filling missing values, and basic data transformations.
It's suitable for smaller datasets where manual cleaning is manageable.
Exploratory Data Analysis (EDA): You can use Excel to generate basic summary
statistics, histograms, and charts to gain initial insights into your data.
PivotTables and PivotCharts can be useful for exploring relationships within the
data.
Data Visualization: Excel offers a variety of chart types, making it possible to
create simple visualizations for sharing insights with non-technical stakeholders.
Basic Statistical Analysis: Excel provides functions for calculating basic
statistics like mean, median, standard deviation, and correlations. You can
perform simple statistical tests and calculations.
What-If Analysis: Excel's scenario manager and goal seek features can help you
perform "what-if" analyses, allowing you to understand how changes in variables
impact outcomes.
Regression Analysis: You can use Excel's built-in regression analysis tool to
perform linear regression and analyze relationships between variables.

Time Series Analysis: Excel can be used for basic time series analysis, including
creating time series plots and calculating moving averages.
Basic Data Mining: Excel offers features like data filtering, sorting, and
conditional formatting, which can help you explore patterns and trends in your
data.
Despite these advantages, there are limitations to using Excel in data science:
● Scalability: Excel is not designed to handle very large datasets efficiently. As
data sizes grow, Excel's performance might degrade significantly.
● Complex Analysis: For advanced statistical analysis, machine learning, and more
complex tasks, you might find Excel limited compared to dedicated data science
tools and programming languages.
● Reproducibility and Automation: Excel lacks the ability to easily script and
automate processes, making it challenging to reproduce analyses or create
reusable workflows.
● Customization: While Excel provides standard functions and tools, it might not
be flexible enough to accommodate custom analyses or specialized algorithms.
For more advanced data science tasks, consider using dedicated data science tools and
programming languages like Python (with libraries like pandas, NumPy, scikit-learn, etc.)
or R. These languages offer more robust capabilities for data manipulation, analysis,
and modeling, and they are widely used in the data science community. However, for
beginners or for quick exploratory tasks, Excel can serve as a useful starting point in
your data science journey.

Python is one of the most popular programming languages in the field of data science
due to its versatility, rich ecosystem of libraries, and ease of use. It's widely used for
various data-related tasks, including data analysis, machine learning, data visualization,
and more. Here's how Python is used in data science:
Data Manipulation and Analysis: Python's pandas library provides powerful tools
for data manipulation, transformation, and analysis. It offers data structures like
DataFrame that make it easy to work with structured data.
Data Visualization: Libraries like Matplotlib, Seaborn, and Plotly allow you to
create a wide range of visualizations to help you explore and communicate
insights from your data.
Statistical Analysis: Python provides libraries for statistical analysis, including
scipy and statsmodels, which enable you to perform hypothesis testing, ANOVA,
regression analysis, and more.
Machine Learning: Python has robust machine learning libraries like scikit-learn,
TensorFlow, and PyTorch. These libraries provide implementations of various
algorithms for classification, regression, clustering, and deep learning.
Natural Language Processing (NLP): Libraries like NLTK and spaCy enable you
to work with text data, perform tasks like text classification, sentiment analysis,
and entity recognition.
Web Scraping: Python's libraries like Beautiful Soup and Requests allow you to
extract data from websites and APIs, which can be useful for collecting data for
analysis.
Time Series Analysis: Libraries like pandas and statsmodels offer tools for
handling time series data, performing decomposition, forecasting, and more.

Geospatial Analysis: Libraries like geopandas and folium enable geospatial data
manipulation, visualization, and analysis.
Data Preprocessing: Python provides functions for data cleaning, transformation,
and feature engineering, which are essential steps before applying machine
learning algorithms.
Interactive Notebooks: Jupyter Notebooks provide an interactive environment
where you can combine code, visualizations, and explanations. They're popular
for documenting and sharing data science workflows.
Community and Resources: Python's strong data science community means
you'll find plenty of resources, tutorials, and libraries to help you learn and apply
data science techniques effectively.
R is a programming language and environment that is widely used in the field of data
science and statistics. It was designed specifically for data analysis and statistical
modeling, making it a powerful tool for various data-related tasks. Here's how R is used
in data science:
Data Manipulation and Analysis: R offers a rich ecosystem of packages, with the
core library known as "base R" providing functions for data manipulation,
transformation, and analysis. Additionally, the tidyverse collection of packages,
including dplyr and tidyr, provides a more user-friendly syntax for data
manipulation.
Statistical Analysis: R is renowned for its statistical capabilities. It provides a
wide range of built-in statistical functions and packages like statsmodels and
caret that allow you to perform various types of statistical analyses, hypothesis
testing, linear and nonlinear modeling, and more.

Data Visualization: The ggplot2 package in R is highly regarded for creating
customizable and publication-quality visualizations. It follows a grammar of
graphics approach, making it easy to create complex plots and visualizations.
Machine Learning: R has a growing ecosystem of machine learning packages,
including caret, randomForest, xgboost, and more. These packages provide
implementations of algorithms for classification, regression, clustering, and
more.
Time Series Analysis: R provides specialized packages like forecast and tseries
for time series analysis, allowing you to perform tasks like decomposition,
forecasting, and anomaly detection.
Text Mining and Natural Language Processing (NLP): R offers packages like tm
and quanteda for text mining and NLP tasks, including text preprocessing,
sentiment analysis, topic modeling, and more.
Interactive Data Exploration: R Markdown and Shiny allow you to create
interactive documents, reports, and web applications that integrate code,
visualizations, and explanatory text.
Community and Resources: R has a strong and active community of data
scientists and statisticians, resulting in a wealth of resources, tutorials, and
packages available for various data science tasks.
Data Preprocessing and Cleaning: R provides functions and packages for data
cleaning, transformation, and feature engineering, similar to Python's pandas
library.
Advanced Statistics: R excels in providing a wide range of advanced statistical
techniques and specialized packages for fields like econometrics, bioinformatics,
and social sciences.

R's emphasis on statistics and data analysis makes it a preferred choice for
researchers, statisticians, and professionals in fields that require in-depth statistical
understanding. Its extensive library ecosystem, along with its capabilities in data
manipulation, analysis, and visualization, makes it a valuable tool for anyone working on
data-related tasks and analyses.
Hadoop is an open-source framework designed for distributed storage and processing
of large volumes of data across clusters of commodity hardware. It provides a scalable
and fault-tolerant solution for handling Big Data, which includes massive amounts of
structured, semi-structured, and unstructured data. Hadoop's architecture is built
around two core components: Hadoop Distributed File System (HDFS) for storage and
MapReduce for processing.
Here's an overview of the key components and concepts of Hadoop:
Hadoop Distributed File System (HDFS): HDFS is a distributed file system
designed to store very large files across multiple machines while providing fault
tolerance. Data is divided into blocks and replicated across nodes in the cluster
to ensure reliability. HDFS is optimized for handling large files, making it suitable
for Big Data storage.
MapReduce: MapReduce is a programming model and processing engine for
distributed data processing. It simplifies the processing of large datasets by
breaking down tasks into smaller subtasks that can be executed in parallel
across the cluster. The Map phase involves processing data and emitting key-
value pairs, while the Reduce phase aggregates and summarizes the results.
YARN (Yet Another Resource Negotiator): YARN is the resource management
layer of Hadoop that enables efficient sharing of cluster resources among

different applications. It manages resources and schedules tasks, allowing
multiple applications to run simultaneously on the same Hadoop cluster.
Hadoop Ecosystem: Hadoop's ecosystem includes various additional tools and
projects that extend its capabilities. Some notable components include:
● Hive: A data warehouse infrastructure that provides a SQL-like language
for querying and managing large datasets stored in HDFS.
● Pig: A high-level platform for creating MapReduce programs using a
scripting language.
● HBase: A distributed, scalable NoSQL database that can store and
manage large amounts of sparse data.
● Spark: A data processing engine that offers faster processing than
traditional MapReduce by utilizing in-memory computations.
● Impala: A query engine that provides real-time interactive SQL queries on
Hadoop data.
● Sqoop: A tool for transferring data between Hadoop and relational
databases.
● Flume and Kafka: Tools for collecting, aggregating, and moving data into
Hadoop from various sources.
Hadoop's distributed storage and processing capabilities provide a powerful framework
for managing Big Data, while data science techniques extract valuable insights from the
data.
SQL databases and data science are closely connected, as SQL databases serve as a
fundamental source of structured data for many data science projects. SQL (Structured
Query Language) databases provide a structured and efficient way to store, manage,
and retrieve data, which can then be used for various data analysis and machine
learning tasks.
SQL databases provide a foundational layer for data science by offering structured data
storage, retrieval, and manipulation capabilities. Data scientists often use SQL queries
to prepare and extract valuable insights from data before applying more advanced
analysis techniques and machine learning algorithms.
Loading data into R is a fundamental step in the data analysis process. R provides
several functions and methods to read data from various file formats and sources. Here
are some common ways to load data into R:
Reading CSV Files: CSV (Comma-Separated Values) files are a widely used format for
storing tabular data. You can use the read.csv() function to read data from a CSV file
data <- read.csv("data.csv")
Overview of the data science process

The typical data science process consists of six steps through which you’ll iterate, as
shown in figure.
1. The first step of this process is setting a research goal. The main purpose here
is making sure all the stakeholders understand the what, how, and why of the
project. In every serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for
analysis, so this step includes finding suitable data and getting access to the
data from the data owner. The result is data in its raw form, which probably
needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes
transforming the data from a raw form into data that’s directly usable in your
models. To achieve this, you’ll detect and correct different kinds of errors in
the data, combine data from different data sources, and transform it. If you
have successfully completed this step, you can progress to data visualization
and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and
deviations based on visual and descriptive techniques. The insights you gain
from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make
the predictions stated in your project charter. Now is the time to bring out the
heavy guns, but remember research has taught us that often (but not always)
a combination of simple models tends to outperform one complicated model.
If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and
automating the analysis, if needed. One goal of a project is to change a process
and/or make better decisions. You may still need to convince the business that
your findings will indeed change the business process as expected. This is
where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Certain projects
require you to perform the business process over and over again, so
automating the project will save time.
Difference Between Data Science with BI (Business Intelligence)
Parameters Business Intelligence Data Science
Perception Looking Backward Looking Forward

Data Structured Data. Mostly SQL, but some Structured and Unstructured
Sources time Data Warehouse) data.
Like logs, SQL, NoSQL, or
text
Approach Statistics & Visualization Statistics, Machine Learning,

and Graph
Emphasis Past & Present Analysis & Neuro-linguistic

Programming
Tools Pentaho. Microsoft Bl, QlikView, R, TensorFlow
Applications of Data Science
Some application of Data Science are:

Internet Search:
Google search uses Data science technology to search for a specific result within a
fraction of a second
Recommendation Systems:
To create a recommendation system. For example, “suggested friends” on Facebook or
suggested videos” on YouTube, everything is done with the help of Data Science.
Image & Speech Recognition:
Speech recognizes systems like Siri, Google Assistant, and Alexa run on the Data
science technique. Moreover, Facebook recognizes your friend when you upload a
photo with them, with the help of Data Science.
Gaming world:
EA Sports, Sony, Nintendo are using Data science technology. This enhances your
gaming experience. Games are now developed using Machine Learning techniques,
and they can update themselves when you move to higher levels.
Online Price Comparison:
PriceRunner, Junglee, Shopzilla work on the Data science mechanism. Here, data is
fetched from the relevant websites using APIs.
Challenges of Data Science Technology
● A high variety of information & data is required for accurate analysis

● Not adequate data science talent pool available
● Management does not provide financial support for a data science team
● Unavailability of/difficult access to data
● Business decision-makers do not effectively use data Science results
● Explaining data science to others is difficult
● Privacy issues
● Lack of significant domain expert
● If an organization is very small, it can’t have a Data Science team

Unit 1

Uploaded by

Copyright:

Available Formats

Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

Exploring Data Science

Data science is an interdisciplinary field that uses scientific methods, processes,

Understand the Basics: Start by familiarizing yourself with the fundamental

Feature Engineering: Feature engineering involves creating new features or

Machine Learning Algorithms: Explore different machine learning algorithms

Why Data Science?

Here are significant advantages of using Data Analytics Technology:

Statistical Analysis Tools:

and generate meaningful insights.

1. How much or how many? (regression)

Handling Missing Values

Here are some key aspects of business intelligence:

In summary, business intelligence is a multidisciplinary approach that combines technology,

satisfaction, identify growth opportunities, and stay competitive in an ever-evolving market.

as removing duplicates, filling missing values, and basic data transformations.

It's suitable for smaller datasets where manual cleaning is manageable.

Data Visualization: Excel offers a variety of chart types, making it possible to

create simple visualizations for sharing insights with non-technical stakeholders.

Basic Statistical Analysis: Excel provides functions for calculating basic

perform simple statistical tests and calculations.

perform "what-if" analyses, allowing you to understand how changes in variables

perform linear regression and analyze relationships between variables.

creating time series plots and calculating moving averages.

● Scalability: Excel is not designed to handle very large datasets efficiently. As

data sizes grow, Excel's performance might degrade significantly.

tools and programming languages.

automate processes, making it challenging to reproduce analyses or create

be flexible enough to accommodate custom analyses or specialized algorithms.

your data science journey.

and more. Here's how Python is used in data science:

DataFrame that make it easy to work with structured data.

create a wide range of visualizations to help you explore and communicate

insights from your data.

Statistical Analysis: Python provides libraries for statistical analysis, including

regression analysis, and more.

TensorFlow, and PyTorch. These libraries provide implementations of various

algorithms for classification, regression, clustering, and deep learning.

and entity recognition.

handling time series data, performing decomposition, forecasting, and more.

manipulation, visualization, and analysis.

Data Preprocessing: Python provides functions for data cleaning, transformation,

Interactive Notebooks: Jupyter Notebooks provide an interactive environment

for documenting and sharing data science workflows.

Community and Resources: Python's strong data science community means

data science techniques effectively.

transformation, and analysis. Additionally, the tidyverse collection of packages,

Statistical Analysis: R is renowned for its statistical capabilities. It provides a

testing, linear and nonlinear modeling, and more.

customizable and publication-quality visualizations. It follows a grammar of

graphics approach, making it easy to create complex plots and visualizations.

Machine Learning: R has a growing ecosystem of machine learning packages,

including caret, randomForest, xgboost, and more. These packages provide

implementations of algorithms for classification, regression, clustering, and

forecasting, and anomaly detection.

sentiment analysis, topic modeling, and more.

Interactive Data Exploration: R Markdown and Shiny allow you to create

interactive documents, reports, and web applications that integrate code,

visualizations, and explanatory text.

Community and Resources: R has a strong and active community of data