Contact For The Course: - Instructor: Dr. Kauser Ahmed P

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 54

Contact for the course

• Instructor: Dr. Kauser Ahmed P


– kauserahmed@vit.ac.in
– 9942992332
– Office: SJT 310 A21
– Office hours: Wed 11.00 A.M -1.00 P.M
– All course materials will be posted in Vtop
Plan for this lecture
• Course information – syllabus, grading, etc.
• What is data science ?
• Data Science - why all the excitement
• Basic Python programming
Course Structure
About the Course
● Course Code : MDI3002

Course Title : Foundations of Data Sceince

No. of Hours : 45

CAT 1 and CAT 2 : (15 + 15 Marks)

No. of Digital Assignments : 3 (10+10+10 marks)

Theory FAT (40 Marks)
Course Objectives
• To provide fundamental knowledge on data science and
to understand the role of statistics and optimization to
perform mathematical operation in the field of data
science.
• To understand the process of handling heterogeneous
data and visualize them for better understanding.
• To gain the fundamental knowledge on various open
source data science tools and understand their process
of applications to solve various industrial problems.
Course Outcomes
• Ability to obtain fundamental knowledge on data science.
• Demonstrate proficiency in statistical analysis of data.
• Develop mathematical knowledge and study various
optimization techniques to perform data science
operations.
• Handle various types of data and visualize them using
through programming for knowledge representation.
• Demonstrate numerous open source data science tools to
solve real-world problems through industrial case studies.
Course Syllabus
Text and reference books
Module 1 Basics of Data Science
• Introduction; Typology of problems;
Importance of linear algebra, statistics and
optimization from a data science perspective;
Structured thinking for solving data science
problems, Structured and unstructured data
What is Data Science ?

• From the ACM Taskforce on Data Science Curricula - Draws from many different disciplines
What is Data Science ?
• Statistics;
– a central component of data science. Why?
– statistics studies how to make robust conclusions with incomplete information.

• Computing
– is a central component because
– programming allows us to apply analysis techniques to the large and diverse data sets
that arise in real-world applications:
– not just numbers, but text, images, videos, and sensor readings.

• Domain Knowledge
– Through understanding a particular domain, data scientists learn to ask appropriate
questions about their data and correctly interpret the answers provided by our inferential
and computational tools.

Data science is all of these things, but it is more than the sum of its parts because of the
applications.
What is data science?
 Data science enables businesses to process huge amounts
of structured and unstructured big data to detect patterns.
 This in turn allows companies to increase efficiencies,
manage costs, identify new market opportunities, and
boost their market advantage.
 Asking a personal assistant like Alexa or Siri for a
recommendation demands data science.
 Operating a self-driving car, using a search engine that
provides useful results,
 talking to a chatbot for customer service.
What is Foundations of Data Science?
• Drawing useful conclusions from data using computation

• Exploration
– Identifying patterns in information
– Uses visualizations

• Inference
– Quantifying whether those patterns are reliable
– Uses randomization
– quantifying our degree of certainty: will those patterns we found also appear in new
observations? How accurate are our predictions?

• Prediction
– Making informed guesses
– Uses machine learning
Example data science applications
• Marketing: predict the characteristics of high life time value
(LTV) customers, which can be used to support customer
segmentation, identify upsell opportunities, and support
other marking initiatives
• Logistics: forecast how many of which things you need and
where will we need them, which enables learn inventory
and prevents out of stock situations
• Healthcare: analyze survival statistics for different patient
attributes (age, blood type, gender, etc.) and treatments;
predict risk of re-admittance based on patient attributes,
medical history, etc.
More Examples
• Transaction Databases  Recommender systems (NetFlix), Fraud Detection
(Security and Privacy)

• Wireless Sensor Data  Smart Home, Real-time Monitoring, Internet of Things

• Text Data, Social Media Data  Product Review and Consumer Satisfaction
(Facebook, Twitter, LinkedIn), E-discovery

• Software Log Data  Automatic Trouble Shooting (Splunk)

• Genotype and Phenotype Data  Epic, 23andme, Patient-Centered Care,


Personalized Medicine
Data Science in E-commerce
• Product Recommendations
• Market Basket Analysis
• Price Optimization
• Demand Forecasting
• Product Recommendations
• Customer Support
• Fraud Protection
• Churn prediction
Product Recommendations

• It is a tool that uses machine learning algorithms to


suggest products to customers that they might be
interested in
• The recommendations can happen through the
following channels:
• Within websites
• Email campaigns
• Online ads
• The recommendation engines are of two types:
• Impersonalized Recommendation Engine:
• personalized Recommendation Engine:
Market Basket Analysis

• MBA is a technique that identifies the strength of


association between a pair of products that are
purchased together
• Market Basket Analysis uses machine learning, deep
learning algorithms
• The base behind the market basket analysis is
Association rule a machine learning method
• It’s a combination of baskets also known as segments
and measure
• It involves support, confidence, Lift, chi-square
analysis etc
Price Optimization

• It is a balance between customer satisfaction


and the firm’s profit
• Mainly incorporate with
• Segmentation of Customer & Products
• Regression Modeling
• Dynamic Pricing
• Most of the customers say that the most
essential aspect of shopping at an online e-
commerce store is competitive pricing.
Demand Forecasting

• Demand forecasting refers to the use of


analytics techniques to predict the demand for
a product and forecast sales. When you know
the sales trends in advance, it gives you an
advantage over your competitors in the
following ways:
• Pricing strategy
• Customer service
• Inventory management
• Cash flow management
Data Science

• A foundation in DS requires
– Not just understand statistical and computational techniques,
– but also recognizing how they apply to real scenarios.

• Whatever aspect of the world we wish to study— Earth’s weather/markets/ polls


– data we collect typically offer an incomplete description of the subject at hand.
– A central challenge of data science - make reliable conclusions using partial information.

• In this endeavor, combine two essential tools: computation and randomization.

– Computing skills
• will allow us to use all available information to draw conclusions.
• Rather than focusing only on the average temperature of a region, we will consider the whole range of
temperatures together to construct a more nuanced analysis.

– Randomness
• allow us to consider the many different ways in which incomplete information might be completed.
• Rather than assuming that temperatures vary in a particular way, we will learn to use randomness as a
way to imagine many possible scenarios that are all consistent with the data we observe.
Statistical Techniques

• The discipline of statistics has long addressed the same


fundamental challenge as data science: how to draw robust
conclusions about the world using incomplete information.

• An important contributions of statistics : consistent and precise


vocabulary for describing the relationship between
observations and conclusions.

• Follow the same tradition, focus on a set of core inferential


problems from statistics:
– testing hypotheses,
– estimating confidence, and
– predicting unknown quantities.
Data Science…Goes beyond Statistics
• Data science extends the field of statistics by taking full advantage of
– computing,
– data visualization,
– machine learning,
– optimization, and
– access to information.

• The combination of fast computers and the Internet gives anyone the ability
to access and analyze vast datasets:
– millions of news articles,
– full encyclopedias,
– databases for any domain, and
– massive repositories of music, photos, and video.

• Challenge - Real data often do not follow regular patterns/ match standard
equations.

• The interesting variation in real data can be lost by focusing too much
attention on simplistic summaries such as average values.
Contrast: Databases
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak,
Memcached,
Apache River, …

ACID = Atomicity, Consistency, Isolation and Durability


CAP = Consistency, Availability, Partition Tolerance
Contrast: Business Intelligence
Business Intelligence Data Science

Querying the past Querying the past


present and future
Contrast: Machine Learning
Machine Learning Data Science
Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of
models Understand empirical properties of
models
Improve/validate on a few, Develop/use tools that can handle
relatively clean, small datasets massive datasets
Take action!
Publish a paper
Importance of Linear Algebra
 Linear algebra is a very fundamental part of data science
 identify the most important concepts from linear
algebra, that are useful in data science
 Linear algebra can be treated very theoretically very
formally; however, in the short module on linear
algebra which has relevance to data science
 In data science, Data Representation becomes an
important aspect of data science and data is
represented usually in a matrix form
 The second important thing that one is interested from
a data science perspective is, if this data contains several
variables of interest,
 like to know how many of these variables are really
important and if there are relationships between
these variables and if there are these relationships,
how does one un-cover these relationships?
 another interesting and important question that need
to answer from the viewpoint of understanding data.
 Linear algebraic tools allow us to understand this
 The third block that we have basically says that the ideas
from linear algebra become very important in all kinds
of machine learning algorithms.
Optimization for Data Science
 From a mathematical foundation viewpoint, it can be
said that the three pillars for data science that are
Linear Algebra, Statistics and the third pillar is
Optimization which is used pretty much in all data
science algorithms.
 And to understand the optimization concepts one
needs a good fundamental understanding of linear
algebra.
 optimization as a problem where you maximize or
minimize a real function by systematically choosing
input values from an allowed set and computing the
value of the function.
 That means when we talk about optimization we are
always interested in finding the best solution. So, let
say that one has some functional form
 minimize𝑓0(𝑥) S.t
𝑓𝑖(𝑥) ≤ 0,𝑖= 1,
ℎ𝑗(𝑥) ≤ 0,𝑗= …𝑘
1, …𝑙
 A basic understanding of optimization will help
in.,
 More deeply understand the working of machine learning
algorithms.
 Rationalize the working of the algorithm.That means if you get
a result and you want to interpret it, and if you had a very
deep understanding of optimization you will be able to see
why you got the result.
 And at an even higher level of understanding, you might be
able to develop new algorithms yourselves.
 Generally, an optimization problem has three
components. minimize 𝑓(𝑥),𝑤.𝑟.𝑡 𝑥,𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑎 ≤ 𝑥 ≤ 𝑏
 The objective function(f(x)): The first component is an
objective function f(x) which we are trying to either
maximize or minimize. In general, we talk about minimization
problems this is simply because if you have a maximization
problem with f(x) we can convert it to a minimization
problem with -f(x).
So, without loss of generality, we can look at minimization
problems.
 Decision variables(x): The second component is the
decision variables which we can choose to minimize
the function. So, we write this as min f(x).
 Constraints(a ≤ x ≤ b): The third component is the
constraint which basically constrains this x to some
set.
Types of Optimization Problems:
 Depending on the types of constraints only:
 Constrained optimization problems: In cases where the
constraint is given there and we have to have the solution
satisfy these constraints we call them constrained
optimization problems.
 Unconstrained optimization problems: In cases where
the constraint is missing we call them unconstrained
optimization problems.
Structured Thinking
 Structured thinking is a framework for solving
unstructured problems — which covers almost all
data science problems.
 Using a structured approach to solve problems
doesn't only help with solving the problem faster but
also identifies the parts of the problem that may need
some extra attention.
 Think of structured thinking like the map of a new
city
you’re visiting.
 Without a map, you probably will find it difficult to
reach your destination.
 Even if you did eventually reach your distinction, it
would probably have taken you to double the time you
might have needed if you did have a map.
 Structured thinking is a framework and not a fixed
mindset; it can be modified to match any problem
you need to solve.
Structured Data
 Structured data categorized as quantitative data is
highly organized and easily decipherable by machine
learning algorithms.
 Developed by IBM in 1974, structured query
language (SQL) is the programming language used to
manage structured data.
 By using a relational (SQL) database, business users
can quickly input, search and manipulate structured
data.
 Examples of structured data include dates, names,
addresses, credit card numbers, etc.Their benefits are
tied to ease of use and access, while liabilities revolve
around data inflexibility
Pros of structured data
 Easily used by machine learning (ML) algorithms:
The specific and organized architecture of structured
data eases manipulation and querying of ML data.
 Easily used by business users: Structured data does
not require an in-depth understanding of different types
of data and how they function.With a basic
understanding of the topic relative to the data, users can
easily access and interpret the data.
 Accessible by more tools: Since structured data
predates unstructured data, there are more tools
available for using and analyzing structured data.
Cons of structured data
 Limited usage: Data with a predefined structure
can only be used for its intended purpose, which
limits its flexibility and usability.
 Limited storage options: Structured data is generally
stored in data storage systems with rigid schemas (e.g.,
“data warehouses”).Therefore, changes in data
requirements necessitate an update of all structured
data, which leads to a massive expenditure of time and
resources.
Structured data tools

 OLAP: Performs high-speed, multidimensional


data analysis from unified, centralized data stores.
 SQLite: Implements a self-contained, serverless,
zero- configuration, transactional relational database
engine.
 MySQL: Embeds data into mass-deployed
software, particularly mission-critical, heavy-load
production system.
 PostgreSQL: Supports SQL and JSON querying as
well as high-tier programming languages (C/C+, Java,
Python, etc.).
Use cases for structured data
 Customer relationship management (CRM):
CRM software runs structured data through analytical
tools to create datasets that reveal customer behavior
patterns and trends.
 Online booking: Hotel and ticket reservation data (e.g.,
dates, prices, destinations, etc.) fits the “rows and
columns” format indicative of the pre-defined data
model.
 Accounting: Accounting firms or departments
use structured data to process and record
financial transactions.
Unstructured data
 Unstructured data, typically categorized as qualitative
data, cannot be processed and analyzed via
conventional data tools and methods.
 Since unstructured data does not have a predefined data
model, it is best managed in non-relational (NoSQL)
databases. Another way to manage unstructured data is
to use data lakes to preserve it in raw form.
Pros of unstructured data
 Native format: Unstructured data, stored in its
native format, remains undefined until needed. Its
adaptability increases file formats in the database,
which widens the data pool and enables data scientists
to prepare and analyze only the data they need.
 Fast accumulation rates: Since there is no need to
predefine the data, it can be collected quickly and
easily.
 Data lake storage: Allows for massive storage and
pay- as-you-use pricing, which cuts costs and eases
scalability.
Cons of unstructured data
 Requires expertise: Due to its undefined/non-
formatted nature, data science expertise is required to
prepare and analyze unstructured data.This is beneficial
to data analysts but alienates unspecialized business
users who may not fully understand specialized data
topics or how to utilize their data.
 Specialized tools: Specialized tools are required
to manipulate unstructured data, which limits
product choices for data managers.

 Courtesy by IBM
Unstructured data tools
 MongoDB: Uses flexible documents to process data
for cross-platform applications and services.
 DynamoDB: Delivers single-digit millisecond
performance at any scale via built-in security, in-
memory caching and backup and restore.
 Hadoop: Provides distributed processing of large data
sets using simple programming models and no
formatting requirements.
 Azure: Enables agile cloud computing for creating and
managing apps through Microsoft’s data centers.
Use cases for unstructured data

 Data mining: Enables businesses to use


unstructured data to identify consumer behavior,
product sentiment, and purchasing patterns to better
accommodate their customer base.
 Predictive data analytics: Alert businesses of
important activity ahead of time so they can properly
plan and accordingly adjust to significant market shifts.
 Chatbots: Perform text analysis to route
customer questions to the appropriate answer
sources.
Semi-
Properties Structured Unstructured data
structured
data
data
It is based on It is based on
It is based on
Technology Relational database character and binary
XML/RDF(Resourc
table data
e Description
Framework).
Matured transaction
Transaction is No transaction
Transaction and various
adapted from DBMS management and no
managemen concurrency
not matured concurrency
t techniques
Versioning over
Version Versioning over Versioned as
tuples or graph
manageme tuples,row,tables a
is possible
nt whole
It is more flexible
It is schema It is more flexible
than structured data
Flexibility dependent and less and there is absence
but less flexible
flexible of schema
than unstructured
data
It is very difficult to It’s scaling is simpler
Scalability It is more
scale DB schema than structured
scalable.
data
New technology,
Robustness Very robust —
not very spread
Structured query Queries over
Only textual
Query performance allow complex anonymous nodes
queries are
joining are possible
possible
References
• The material is prepared by taking inputs from
various text books and other internet sources.

You might also like