Unit 2 1
Unit 2 1
Unit 2 1
Visualization(303105314)
Unit 2 : Introduction to Python Fundamentals
and Statistics
Arun Chauhan
Assistant Professor
Computer Science & Engineering
Outline
• Introduction to Python
• Importance of Python
• Python Libraries for stats
• Introduction to stats
• Central tendency and Dispersion
• Types of variable
• Levels of Data measurement
• Sampling and Sampling Distribution
• Distribution of Sample Means, Population and Variance
• Confidence interval estimation
Introduction to Python
Python:
• …is a general purpose interpreted programming language.
• …is a language that supports multiple approaches to software
design, principally structured and object-oriented programming.
• …provides automatic memory management and garbage
collection
• …is extensible
• …is dynamically typed.
Importance of Python
• Rich Ecosystem of Libraries: Python boasts powerful libraries such
as Matplotlib, Seaborn, Plotly, and Bokeh, which offer extensive
functionality for creating various types of visualizations. These
libraries provide high-quality plots with customizable features.
Pandas Series
Descriptive stats:
• Descriptive statistics involve methods for summarizing and
describing the features of a dataset.
Contd.
• descriptive statistics include measures of central tendency
(mean, median, mode), measures of dispersion (range, variance,
standard deviation), and measures of shape (skewness, kurtosis).
Inferential statistics:
• Inferential statistics is a branch of statistics that involves making
inferences or predictions about a population based on a sample
of data taken from that population. It's used to draw conclusions,
make predictions, or test hypotheses about a larger group
(population) based on the characteristics observed in a smaller
sample from that group.
Central tendency and Dispersion
Mean=
Contd.
Median: The median is the middle value in a sorted list of numbers.
If there is an even number of values, the median is the average of
the two middle numbers.
Median of 2, 4, 7 = 4
Contd.
Even number of numbers, median = mean of the two middle numbers
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• Range
• Variance
• Standard Deviation
Contd.
Range: The range is the difference between the largest and smallest
values in a data set.
Consider the following set of numbers representing the ages of a
group of people:
{18,22,25,30,35,40,45}
Nominal:
• A nominal scale classifies data into distinct categories in which no
ranking is implied.
• Example: Gender, Marital Status
• gender suppose you are conducting a questionnaire. Suppose
you captured the gender male 0, female 1. This 0 1 represents
just the gender. You cannot do any arithmetic operations with the
help of the 0 & 1.
Contd.
Ordinal
• An ordinal scale classifies data into distinct categories in which
ranking is implied.
• Example : Product satisfaction: Satisfied, Neutral, Unsatisfied
Faculty rank: Professor, Associate Prof, Assistant Prof
Interval
• An interval scale is an ordered scale in which the difference
between measurements is meaningful quantity but the
measurements do not have a true zero point
Contd.
• For example, in the case of temperature, a reading of 0°C or 0°F
doesn't mean there is no temperature; it simply represents an
arbitrary point chosen as the freezing point of water. This lack of
a true zero point prohibits operations like multiplication or
division based on the scale. You can't say that 20°C is "twice as
hot" as 10°C because the zero point is arbitrary and doesn't
represent the complete absence of temperature.
Contd.
Ratio
• The ratio scale is the ordered scale in which the difference
between the measurements is a meaningful quantity and the
measurements have the true zero point.
• Weight, age, salary and the Kelvin temperature comes under
ratio scale. Because 0 Kelvin that means the absence ofthe heat.
• So in the ratio scale, he can do all kinds of arithmetic operation.
Sampling and Sampling Distributions
Population
• A population refers to the entire group of individuals, items, or
events about which you want to draw conclusions or make
inferences.
Sampling and Sampling Distributions
Sample
• A sample is a subset of the population selected for observation or
data collection.
• It represents a smaller, manageable portion of the population
from which data can be collected and analyzed.
Contd.
Non-probability Sampling
• In non-probability sampling, the probability of any particular
member being selected is unknown or cannot be calculated.
• Convenience Sampling: Samples are selected based on their
convenience or accessibility to the researcher.
• Purposive Sampling: Samples are selected based on the
researcher's judgment or specific criteria, often chosen to fulfill a
particular objective or purpose.
Contd.
• Quota Sampling: The population is divided into subgroups, and a
specific number of individuals are sampled from each subgroup based
on predetermined quotas.
• Snowball Sampling: Initial participants are selected, and then
additional participants are recruited based on referrals from those
initial participants.
Contd.
Sampling Distribution
• sampling distribution is a distribution of all of the possible values
of your statistic for a given size sample selected from the
population.
• The sampling distribution depends on multiple factors – the
statistic, sample size, sampling process, and the overall
population. It is used to help calculate statistics such as means,
ranges, variances, and standard deviations for the given sample.
Distribution of Sample Mean, Population and
Variance
Types of Sampling Distributions
of Sampling Distribution
Contd.
1. Sampling distribution of sample mean
• Confidence level = 1 − α