Contact For The Course: - Instructor: Dr. Kauser Ahmed P
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
• From the ACM Taskforce on Data Science Curricula - Draws from many different disciplines
What is Data Science ?
• Statistics;
– a central component of data science. Why?
– statistics studies how to make robust conclusions with incomplete information.
• Computing
– is a central component because
– programming allows us to apply analysis techniques to the large and diverse data sets
that arise in real-world applications:
– not just numbers, but text, images, videos, and sensor readings.
• Domain Knowledge
– Through understanding a particular domain, data scientists learn to ask appropriate
questions about their data and correctly interpret the answers provided by our inferential
and computational tools.
Data science is all of these things, but it is more than the sum of its parts because of the
applications.
What is data science?
Data science enables businesses to process huge amounts
of structured and unstructured big data to detect patterns.
This in turn allows companies to increase efficiencies,
manage costs, identify new market opportunities, and
boost their market advantage.
Asking a personal assistant like Alexa or Siri for a
recommendation demands data science.
Operating a self-driving car, using a search engine that
provides useful results,
talking to a chatbot for customer service.
What is Foundations of Data Science?
• Drawing useful conclusions from data using computation
• Exploration
– Identifying patterns in information
– Uses visualizations
• Inference
– Quantifying whether those patterns are reliable
– Uses randomization
– quantifying our degree of certainty: will those patterns we found also appear in new
observations? How accurate are our predictions?
• Prediction
– Making informed guesses
– Uses machine learning
Example data science applications
• Marketing: predict the characteristics of high life time value
(LTV) customers, which can be used to support customer
segmentation, identify upsell opportunities, and support
other marking initiatives
• Logistics: forecast how many of which things you need and
where will we need them, which enables learn inventory
and prevents out of stock situations
• Healthcare: analyze survival statistics for different patient
attributes (age, blood type, gender, etc.) and treatments;
predict risk of re-admittance based on patient attributes,
medical history, etc.
More Examples
• Transaction Databases Recommender systems (NetFlix), Fraud Detection
(Security and Privacy)
• Text Data, Social Media Data Product Review and Consumer Satisfaction
(Facebook, Twitter, LinkedIn), E-discovery
• A foundation in DS requires
– Not just understand statistical and computational techniques,
– but also recognizing how they apply to real scenarios.
– Computing skills
• will allow us to use all available information to draw conclusions.
• Rather than focusing only on the average temperature of a region, we will consider the whole range of
temperatures together to construct a more nuanced analysis.
– Randomness
• allow us to consider the many different ways in which incomplete information might be completed.
• Rather than assuming that temperatures vary in a particular way, we will learn to use randomness as a
way to imagine many possible scenarios that are all consistent with the data we observe.
Statistical Techniques
• The combination of fast computers and the Internet gives anyone the ability
to access and analyze vast datasets:
– millions of news articles,
– full encyclopedias,
– databases for any domain, and
– massive repositories of music, photos, and video.
• Challenge - Real data often do not follow regular patterns/ match standard
equations.
• The interesting variation in real data can be lost by focusing too much
attention on simplistic summaries such as average values.
Contrast: Databases
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak,
Memcached,
Apache River, …
Courtesy by IBM
Unstructured data tools
MongoDB: Uses flexible documents to process data
for cross-platform applications and services.
DynamoDB: Delivers single-digit millisecond
performance at any scale via built-in security, in-
memory caching and backup and restore.
Hadoop: Provides distributed processing of large data
sets using simple programming models and no
formatting requirements.
Azure: Enables agile cloud computing for creating and
managing apps through Microsoft’s data centers.
Use cases for unstructured data