Fundamentals of Data Analytics Lecture 01. Probability: Instructional Team
Fundamentals of Data Analytics Lecture 01. Probability: Instructional Team
Fundamentals of Data Analytics Lecture 01. Probability: Instructional Team
- Probability
- Statistics
- Hands-on programming skills
- Meet your instructors & classmates
Instructional Team
? DA / ML / DS / BI / AI / 4th IR
? SQL / R…
⇒ Discussing with Instructional team & classmates (Piazza…)
Content of Lecture
➔ Counting Rules
➔ Sample Space, Event
➔ Independent Event
➔ Conditional Probability
➔ Bayes’ Theorem
Motivation Example
Who first get to 3 will win the game and take all money.
INTRODUCTION
Probability & Statistics
Probability & Statistics
Example
What is average height of Vietnamese males?
1. Produce Data: Determine what to measure, then collect the data.
→ Selected 1000 of male adults at random.
→ Measured and collected the height
2. Explore the Data: Analyze and summarize the data.
→ In the sample, the average height is 165.7 cm.
3. Draw a Conclusion: Use the data, probability, and statistical inference
→ Draw a conclusion about the population.
Probability & Statistics
COUNTING
Counting rules
Rule of counting
Event A can occur in n1 ways & Event B can occur in n2 ways
⟹ Events A and B can occur in n1 × n2 ways.
In general, the number of ways that m events can occur is n1 × n2 × . . . × nm.
Example:
How many unique stock-keeping unit (SKU) labels can a chain of hardware stores
create by using two letters (ranging from AA to ZZ) followed by four numbers
(digits 0 through 9)?
Solution:
26 x 26 x 10 x 10 x 10 x 10 = 6,760,000
Counting rules
Factorials
The number of unique ways that n items can be arranged in a particular order is n!
n! = n × (n-1) × (n-2) × … × 2 × 1
Example:
A home appliance service truck must make three stops (A, B, C). In how many ways
could the three stops be arranged?
Solution:
3! = 3 x 2 x 1 = 6
That is {ABC, ACB, BAC, BCA, CAB, CBA}
Counting rules
Permutations
The number of possible permutations of n items taken r in a particular order is
Example:
Five home appliance customers (A, B, C, D, E) need service calls, but the field techni-
cian can service only three of them before noon. The order in which they are serviced
is important (to the customers, anyway) so each possible arrangement of three service
calls is different. The dispatcher must assign the sequence. How many possible
permutation?
Solution:
Counting rules
Combinations
A combination is a collection of r items chosen at random without replacement
from n items where the order of the selected items is not important.
The number of possible combinations of r items chosen from n items is
Example:
Suppose that five customers (A, B, C, D, E) need service calls and the maintenance
worker can only service three of them this morning. The customers don’t care when
they are serviced as long as it’s before noon, so the dispatcher does not care who is
serviced first, second, or third. How many possible combinations?
Solutions:
PROBABILITY
Sample Spaces & Events
Definition
ω Outcome
AC Complement of A (not A)
∅ Null Event
Sample Spaces & Events
Mutually Exclusive Events
MECE?
Probability
Probability
The probability of an event is a number that measures the relative likelihood that the
event will occur.
Axioms of Probability
Example:
An company interviewed 280 production workers before hiring 70 of them.
Let H = event that a randomly chosen interviewee is hired ⇒ P(H) = f/n = 70/280 = 0.25
Law of LARGE number
As the number of trials increases, any empirical probability approaches its theoretical
limit.
How Assigned? → Empirical Approach
Law of LARGE number
As the number of trials increases, any empirical probability approaches its theoretical
limit.
How Assigned? → Empirical Approach
CASE STUDY: Practical Actuaries Issues
Actuaries help companies calculate payout rates on life insurance, pension plans, and health
care plans by estimating the empirical probabilities
Actuaries created the tables that guide IRA withdrawal rates for individuals from age 70 to
99. Here are a few challenges that actuaries face:
1. Is n “large enough” to say that f/n has become a good approximation to the probability
of the event of interest? (Data collection costs money, and decisions must be made)
2. Was the experiment repeated identically? (Subtle variations may exist in the
experimental conditions and data collection procedures)
3. Is the underlying process stable over time? (For example, default rates on 2007
student loans may not apply in 2017, due to changes in attitudes and interest rates)
How Assigned? → Classical Approach
Classical approach
In classical approach, we do not actually have to perform an experiment because the
nature of the process allows us to envision the entire sample space.
→ We can use deduction to determine P(A).
Example:
A priori: the process
In the two-dice experiment, there are 36 possible outcomes.
of assigning
probabilities before
H = rolling a seven
we actually observe
the event or try an
experiment
How Assigned? → Subjective Approach
Subjective approach
A subjective probability reflects someone’s informed judgment about the likelihood of
an event when there is no repeatable random experiment.
Example:
● What is the probability that a new truck product program will show a return on
investment of at least 10 percent?
● What is the probability that the price of Ford’s stock will rise within the next 30
days?
Notes:
In such cases, we rely on personal judgment or expert opinion. However, such a
judgment is not random because it is typically based on experience with similar
events and knowledge of the underlying causal processes.
Interpretations of Probability
P(A) is the long run proportion of times P(A) measures an observer’s strength of
that A is true in repetitions. belief that A is true, or uncertainty of A
If we flip the coin many times, we The coin is equally likely to land heads or
expect it to land heads about half the tails on the next toss
time.
Properties of Probability
Properties of Probability
⦿ P(∅) = 0
⦿ A ⊂ B ⇒ P(A) ≤ P(B)
⦿ 0 ≤ P(A) ≤ 1
⦿ P(AC) = 1 - P(A)
⦿ A ⋂ B = ∅ ⇒ P(A ⋃ B) = P(A) + P(B)
⦿ P(A ⋃ B) = P(A) + P(B) - P(A⋂ B)
Independent Events
Definition
Two events A and B are independent if
Example: Tuition cost versus five-year net salary gains for MBA degree recipients at 67
top-tier graduate schools of business
Contingency Tables
Calculation From Contingency Tables
● Marginal Probability
● Joint probability
● Conditional Probability
● Independence
● Relative Frequencies
Contingency Tables
Marginal Probability
The marginal probability of an event is a relative frequency that is found by dividing a
row or column total by the total sample size.
The joint probability that the school has low tuition (T1) and has large salary gains
(S3)
is P(T1∩S3) = 1/67 = 0.0149
Contingency Tables
Conditional Probability
Conditional probabilities may be found by restricting ourselves to a single row or
column (the condition).
The conditional probability that salary gains are small (S1) given that the MBA tuition
is large (T3) is P(S1 | T3) = 5/32 = 0.1563
Contingency Tables
Independence
To check whether events in a contingency table are independent, we can look at
conditional probabilities.
Example: Is large salary gain (S3) independent of low tuition (T1) ?
Internet 46 25 15
License 20 20 20
Example:
Of the population age 16–21 and not in college: The conditional probability of
being unemployed is greater than
● 13.50% are unemployed (U)
the unconditional probability of
● 29.05% are high school dropouts (D) being unemployed
● 5.32% are unemployed high school dropouts(U∩D). → In other words, knowing that
→ The probability of an unemployed youth given someone is a high school dropout
that the person dropped out: alters the probability that the
person is unemployed.
Bayes’s Theorem
Theorem
Let A and B be event:
General form
If event B to have as many mutually exclusive and collectively exhaustive
categories (B1 , B2 , ... , Bn )
Bayes’s Theorem
Bayes’ Theorem
Example: Rare Disease detection D DC
A medical test for a rare disease D has outcomes (+) and (−).
(+) 0.009 0.099
Suppose you go for a test and get a positive.
What is the probability you have the disease? (−) 0.001 0.891
With :
P(+|D) = 0.009 / (0.009+0.001) = 0.9
P(D) = (0.009 + 0.001) / (0.009 +0.001 + 0.099 + 0.891) = 0.01
P(+) = (0.009 + 0.099) / (0.009 + 0.099 + 0.001 + 0.891) = 0.108
→ P(D|+) = 0.9 x 0.01 / 0.108 = 0.083 = 8.3%
Bayes’ Theorem
Example: Email Filter
A: The email contains the word “free” B1 B2 B3
❏ To practice programming python with a real dataset (will be used in office hours)
Dataset:
All Columns:
Id, Belongs_to_collection, Budget, Genres, Homepage, Imdb_id, Original_language, Original_title, Overview, Popularity,
Poster_path, Production_companies, Production_countries, Release_date, Runtime, Spoken_languages, Status, Tagline,
Title, Keywords, Cast, Crew
Frequently used Columns:
Budget, Genres, Original_language, Popularity, Production_companies, Production_countries, Release_date, Runtime,
Spoken_languages, Tagline, Title, Keywords, Cast, Crew
Programming
● A Crash Course in Python: https://nbviewer.jupyter.org/gist/rpmuller/5920182
● Programming tutorial:
https://colab.research.google.com/drive/1IOysoRfcxFyGJnjVKY8pPTCJ0gG7EiZD
● Python tutorial: https://github.com/jerry-git/learn-python3
Reference
1. Doane, David P., and Lori E. Seward - Applied statistics in business and economics
2. Wasserman, Larry - All of statistics: a concise course in statistical inference
3. https://luminousmen.com/
4. http://www.mas.ncl.ac.uk/~ndah6/teaching/MAS1403/notes_chapter6.pdf
5. lumenlearning.com
End of Lecture 01
● What you have learned
○ Counting Rules
○ Sample Space, Event
○ Independent Event
○ Conditional Probability
○ Bayes’ Theorem
● Questions?
Exercise for discussing
● Ignoring leap years, and assuming birthdays are equally likely to be any day of the
year, what is the chance of a tie in birthdays among the students in this class?
● In any 15-minute interval, there is a 20% probability that you will see at least one
shooting star. What is the probability that you see at least one shooting star in the
period of an hour?
● A certain couple tells you that they have two children, at least one of which is a girl.
What is the probability that they have two girls?
● How can you generate a random number between 1 - 7 with only a die?