Statistics
Statistics
Statistics
In [2]: #mean(Average)
age=[12,32,45,67,80,75,35,16,18,200]
In [3]: np.mean(age)
Out[3]: 58.0
In [4]: weights=[56,57,75,46,56,78,80,85]
In [5]: np.mean(weights)
Out[5]: 66.625
In [6]: df=sns.load_dataset('tips')
In [7]: np.mean(df['total_bill'])
Out[7]: 19.78594262295082
In [8]: #median
np.median(age)
Out[8]: 40.0
In [10]: stats.mode(age)
In [11]: np.median(df['total_bill'])
Out[11]: 17.795
Measurement of Dispersion
In [1]: ages_lst=[24,12,34,55,67,86,54,64,21,9,75,50]
In [4]: np.mean(ages_lst)
Out[4]: 45.916666666666664
In [5]: np.var(ages_lst)
Out[5]: 595.4097222222222
In [6]: np.std(ages_lst)
Out[6]: 24.401018876723615
In [7]: sns.histplot(ages_lst,kde=True)
In [9]: Data=[[18,25,45],[67,87,43],[23,90,65]]
df=pd.DataFrame(Data,columns=['A','B','c'])
In [10]: df.mean()
Out[10]: A 36.000000
B 67.333333
c 51.000000
dtype: float64
In [11]: df.median()
Out[11]: A 23.0
B 87.0
c 45.0
dtype: float64
In [12]: df.std()
Out[12]: A 26.962938
B 36.692415
c 12.165525
dtype: float64
In [13]: df.var()
Out[13]: A 727.000000
B 1346.333333
c 148.000000
dtype: float64
In [14]: df.var(axis=1)
Out[14]: 0 196.333333
1 485.333333
2 1146.333333
dtype: float64
In [15]: df.std(axis=1)
Out[15]: 0 14.011900
1 22.030282
2 33.857545
dtype: float64
In [ ]:
In [5]: chisquare_test_statistics,p_value
THE END
In [ ]:
In [3]: f_test
Out[3]: 3.874302158273381
In [7]: critical_value
Out[7]: 3.6766746989395105
In [8]: # If ftest is greater than critical value we reject the null hypothesis else we
if f_test > critical_value:
print("Reject the Null hypothesis")
else :
("Fail to reject the Null hypothesis")
The End
In [2]: df=sns.load_dataset('healthexp')
In [3]: df.head()
In [6]: df.cov()
In [7]: #correlation-Pearson
df.corr(method='pearson')
In [9]: df1=sns.load_dataset('flights')
In [10]: df1.cov()
eg: IQ of Students=[70,100,80,90,60]
Descriptive Statistics:
It consists of organizing and summarizng data.
Inferential Statistics
It consists of using the data you have measured to drawn
conclusions.
Types of Data:
Discrete Data:
-->Whole numbers
-->Specific Range
Eg: No. of students in classroom, No. of Family Members, No. of
Vehicles on road,etc.
Continuous Data:
-->Any Value
Eg: Heights, Weight, Temprature, Volume, Speed, etc.
Nominal Data:
--> No Ranks
Eg: Gender, Blood Group, Pincode, Favourite Colour,etc.
Ordinal Data:
--> Ranks
Eg: Marks of Students, Feedbacks,etc.
(iii) Height data of a class: [178.9, 179, 179.5, 176, 177.2, 178.3,
175.8,...] : CONTINUOUS DATA.
If the researcher mistakenly treats the ordinal data as interval data and performs
mathematical operations, such as taking the average of the happiness ratings, it would be
misleading. While averaging the ratings might provide a single value, it would not
accurately represent the participants' true level of happiness since the intervals between
the categories are not equal. Instead, the researcher should use non-parametric statistical
tests suitable for ordinal data, such as the Mann-Whitney U test or the Kruskal-Wallis test,
which consider the ranking order of the categories.
Nominal Data:
Ordinal Data:
In summary, the main distinction between nominal and ordinal data lies in the nature of
the relationship between the categories. Nominal data consists of categories without any
order or ranking, while ordinal data has a natural order among the categories, although
the intervals between them may not be equal. Understanding these differences is crucial
when choosing appropriate statistical analyses and interpreting the results accurately.
A box plot provides a visual representation of the minimum, maximum, median, and
quartiles of a dataset.
Box:
The box represents the interquartile range (IQR), which spans the
middle 50% of the data. The bottom of the box corresponds to the
first quartile (Q1), and the top of the box corresponds to the
third quartile (Q3). The median is usually represented as a line
within the box.
Whiskers:
The whiskers extend from the box and represent the range of the
data, excluding outliers. The whiskers can be calculated in
different ways, such as extending to the minimum and maximum
values within a certain range or considering certain percentile
thresholds.
Outliers:
Individual data points that fall outside the whiskers are
considered outliers and are typically represented as individual
points or asterisks on the plot.
By using a box plot, you can quickly visualize the spread of the data, including the
minimum and maximum values, as well as the distribution across quartiles. It provides a
concise summary of the range and helps identify potential outliers or extreme values.
Box plots are especially useful when comparing multiple datasets or groups, as they allow
for easy visual comparison of the ranges and distributions across different categories or
variables.
Note that other types of plots, such as range plots or error bars, can also display data in
terms of range to some extent, but they may not provide as comprehensive information
about the quartiles and distribution as a box plot does.
Mean:
The mean, or average, is the sum of all values in a dataset
divided by the number of observations. It represents the central
value around which the data points tend to cluster. The mean is
sensitive to extreme values and can be influenced by outliers.
Median:
The median is the middle value in a dataset when the observations
are arranged in ascending or descending order. It is less
affected by extreme values compared to the mean. The median
represents the central value that divides the dataset into two
equal halves.
Mode:
The mode represents the most frequently occurring value or values
in a dataset. It is useful for identifying the most common
observation or category in categorical data.
Measures of Variability:
Standard Deviation:
The standard deviation measures the average amount of deviation
or dispersion of data points from the mean. It quantifies the
spread of the dataset by considering the differences between each
value and the mean, taking into account the variability of the
entire dataset.
Variance:
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment .ipynb 6/7
5/26/23, 9:43 PM Notebooks
THE END
In [ ]:
M = {1,1,1,2,2,3,3,3,3,4,5,6}
mode of M = 3 (the most recurring element)
To determine the appropriate measure of central tendency for a given dataset, consider the
type of data, the presence of outliers, and the shape of the distribution. For normally
distributed numerical data without outliers, the mean is often a good choice. When dealing
with skewed data or potential outliers, the median may provide a more representative
value. The mode is most applicable for categorical or nominal data, or when identifying the
most common value in a dataset.
It's important to note that while these measures provide information about central
tendency, they do not capture the entire picture of the dataset's distribution.
In [2]:
import numpy as np
In [3]:
np.mean(data)
Out[3]: 177.01875
In [4]:
np.median(data)
Out[4]: 177.0
In [5]:
from scipy import stats
In [6]:
stats.mode(data)
In [8]:
np.std(data)
Out[8]: 1.7885814036548633
Variance
Variance measures the average squared deviation of each data point from the mean. It
quantifies the spread of the dataset by considering the differences between each value and
the mean. A higher variance indicates a greater dispersion.
To calculate the variance, you subtract the mean from each value, square the differences,
sum them up, and divide by the total number of values. For example, consider the
following dataset of exam scores: 65, 70, 75, 80, 85.
The mean is 75. The squared differences from the mean are: (65-75)^2, (70-75)^2, (75-
75)^2, (80-75)^2, (85-75)^2. Adding them up and dividing by 5 (the number of values)
gives you the variance. The variance provides a more comprehensive measure of
dispersion than the range but is influenced by the units of the data (since it involves
squaring the differences).
In [9]:
exam_scores=[65, 70, 75, 80, 85]
np.mean(exam_scores)
Out[9]: 75.0
Standard Deviation
The standard deviation is the square root of the variance and is often used as a more
intuitive measure of dispersion. It measures the average amount by which the data points
deviate from the mean.
The standard deviation is calculated by taking the square root of the variance. Using the
same example of exam scores, once you have the variance, you can calculate the standard
deviation by taking the square root of the variance.
The standard deviation provides a more interpretable measure of spread since it is in the
same units as the original data. It is widely used in statistics and helps assess the variability
and consistency of the dataset.
In [10]:
np.std(exam_scores)
Out[10]: 7.0710678118654755
Range
The range is the simplest measure of dispersion and represents the difference between the
largest and smallest values in a dataset. It gives an idea of the total spread of the data. For
example, if you have a dataset of exam scores: 65, 70, 75, 80, 85, the range would be 85 -
65 = 20. The range is easy to calculate but can be influenced by outliers and may not
provide a complete understanding of the distribution.
In [11]:
range= max(exam_scores) - min(exam_scores)
range
Out[11]: 20
In a Venn diagram, each circle or shape represents a set, and the overlapping areas show
the elements that are shared between the sets. The non-overlapping areas represent the
unique elements of each set. The size of each circle does not indicate the size of the set; it
is used purely for visualization purposes.
(i) A intersection B
(ii) A ⋃ B
In [12]:
A = {2,3,4,5,6,7}
B = {0,2,6,8,10}
In [13]:
A & B
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 5/9
5/26/23, 9:44 PM Notebooks
Out[13]: {2, 6}
In [14]:
A|B
Positive Skewness (Right Skewness): In a positively skewed distribution, the tail on the right
side of the distribution is longer or more pronounced than the left tail. This means that the
majority of the data is concentrated on the left side of the distribution, while a few extreme
values are present on the right side.
Negative Skewness (Left Skewness): In a negatively skewed distribution, the tail on the left
side of the distribution is longer or more pronounced than the right tail. This indicates that
the majority of the data is concentrated on the right side of the distribution, while a few
extreme values exist on the left side.
Zero Skewness: A distribution with zero skewness is perfectly symmetric, where the left and
right tails are equally balanced.
In [15]:
snippet=[1, 2, 3, 4, 5,6, 7, 8, 9, 10, 100]
In [16]:
ls=[1,2,3,4,5,6,7,8,9,10]
CHECKING MEAN
In [17]:
import numpy as np
np.mean(snippet)
Out[17]: 14.090909090909092
In [18]:
np.mean(ls)
Out[18]: 5.5
CHECKING MEDIAN
In [19]:
np.median(snippet)
Out[19]: 6.0
In [20]:
np.median(ls)
Out[20]: 5.5
the mean is pulled to the right by the outlier of 100. The median, on the other hand, is not
affected by the outlier, so it is closer to the center of the distribution.
Covariance measures the extent to which two variables vary together. It can be positive,
negative, or zero. A positive covariance indicates that the variables tend to move in the
same direction, while a negative covariance indicates that they tend to move in opposite
directions. A zero covariance indicates that there is no linear relationship between the
variables.
Here are some examples of how covariance and correlation are used in statistical analysis:
Regression analysis is a statistical method that uses one or more independent variables to
predict a dependent variable. Covariance is often used in regression analysis to measure
the strength of the relationship between the independent and dependent variables.
Exploratory data analysis is a statistical method that is used to explore the data and to
identify patterns and relationships. Correlation can be used in exploratory data analysis to
identify relationships between variables.
Data :
,
x1 = 1 x2 = 2 x3 = 3 x4 = 4 , ,
– 1+2+3+4
x = = 2.5
4
Covariance measures the extent to which two variables vary together. It can be positive,
negative, or zero. A positive covariance indicates that the variables tend to move in the
same direction, while a negative covariance indicates that they tend to move in opposite
directions. A zero covariance indicates that there is no linear relationship between the
variables.
Mean : The mean is the average of all the data points. Outliers can cause the mean to
be pulled in the direction of the outlier. For example, if we have a dataset of heights
with an average of 5'8" and one outlier of 7'2", the mean will be 6'0".
Median : The median is the middle value of the data points, when they are arranged
in increasing or decreasing order. Outliers do not affect the median. For example, if we
have a dataset of heights with a median of 5'8" and one outlier of 7'2", the median will
still be 5'8".
Mode : The mode is the value that appears most often in the data set. Outliers can
affect the mode, but not always. For example, if we have a dataset of heights with a
mode of 5'8" and one outlier of 7'2", the mode may still be 5'8". However, if the outlier
is very different from the rest of the data, it may be the mode instead.
Range : The range is the difference between the largest and smallest values in the
data set. Outliers can increase the range. For example, if we have a dataset of heights
with a range of 6", and one outlier of 7'2", the range will be 72".
Standard deviation : The standard deviation is a measure of how spread out the data
is. Outliers can increase the standard deviation. For example, if we have a dataset of
heights with a standard deviation of 2", and one outlier of 7'2", the standard deviation
will be 3".
In general, outliers can make it more difficult to interpret data. It is important to identify
outliers and to decide whether or not to remove them from the data set.
THE END.
In [ ]:
2.Bernoulli Distribution(pmf)
3.Binomial Distribution(pmf)
6.Uniform Distibution(pmf)
In [2]:
def normal_pdf(x, mean, std):
return pdf
In [3]:
# Calculate the probability density function of a normal distribution with mean 0
pdf = normal_pdf(1, 0, 1)
0.24197072451914337
Notation: B(n,p)
Parameters: n Belongs to {0,1,2,3,4...} No. of trials.
p belongs to {0,1} Success Probability for each trial.
q= 1-p
Examples:
Tossing a coin for 10 times,
Mean=np
Var= npq
Std= sqrt of Var.
In [5]:
import matplotlib.pyplot as plt
In [6]:
plt.hist(sample, bins=2, range=[0, 1], edgecolor='black')
plt.xlabel('Success')
plt.ylabel('Frequency')
plt.title('Binomial Distribution')
plt.xticks([0, 1], ['Failure', 'Success'])
plt.show()
In [25]:
poisson_cdf(15,5)
Out[25]: 0.0027924293327009145
Binomial Distribution:
Binomial Distribution can also said as a group of Bernoulli
Distribution.
Notation: B(n,p)
q= 1-p
Examples:
Poisson Distribution:
Discrete random variable.
Describes the no. of events occuring in a fixed time interval.
Examples:
Var = Mean = Expected no. of events to occur at every time interval * time interval
In [36]:
sample_mean = np.mean(sample)
sample_variance = np.var(sample)
In Poisson Distribution,
In Binomial Distribution,
Mean=np
Var= npq
In a normal distribution, the least frequent data appears in the tails of the distribution,
farther away from the mean.
The normal distribution is symmetric, with the mean located at the center. The probability
density function (PDF) of the normal distribution decreases gradually as you move away
from the mean in both directions.
As you move towards the tails of the distribution, the probability of observing data points
decreases. The data points located in the tails, which are farther away from the mean, are
less frequent compared to the data points closer to the mean.
Therefore, the least frequent data appears in the extreme ends of the distribution, in the
tails, while the most frequent data is concentrated around the mean.
The End
In [ ]:
Probability Mass Function (PMF) and Probability Density Function (PDF) are types of
Probability Distribution Function.
Eg: Height of students in a class, Weight of students in a class. Probability density ranges
between 0-1.
The CDF is a monotonically increasing function, meaning that it is always increasing from
left to right. The CDF is also continuous from the right, meaning that it is continuous at all
points except for its endpoint.
Intelligence Quotient
Ages
Weights
Hieghts and much more...
The shape of the normal distribution is determined by two parameters: the mean (μ) and
the standard deviation (σ).
Mean (μ): The mean determines the center or peak of the distribution. It represents
the average value around which the data cluster. Shifting the mean to the right or left
will shift the entire distribution accordingly.
Standard Deviation (σ): The standard deviation determines the spread or dispersion of
the distribution. A larger standard deviation results in a wider and flatter distribution,
indicating more variability in the data. Conversely, a smaller standard deviation leads
to a narrower and taller distribution, indicating less variability.
Together, the mean and standard deviation uniquely define the shape of the normal
distribution. Altering these parameters will shift the distribution horizontally or vertically
while preserving its characteristic bell shape.
Central Limit Theorem: The normal distribution plays a fundamental role in the Central
Limit Theorem (CLT).
Data Modeling: The normal distribution provides a useful framework for modeling and
analyzing continuous data in many real-world scenarios.
Statistical Inference: Many statistical techniques and hypothesis tests rely on the
assumption of normality.
Quality Control: In manufacturing and quality control processes, the normal distribution is
often used to monitor and assess product quality.
Bernaulli Distribution
In probability theory and statistics, the Bernoulli distribution, named after Swiss
mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable
which takes the value 1 with probability p and the value 0 with probability q= 1-p. Less
formally, it can be thought of as a model for the set of possible outcomes of any single
experiment that asks a yes–no question.
Binomial Distribution.
Binomial Distribution can also said as a group of Bernoulli Distribution.
Examples:
In [4]:
(60-50)/10
Out[4]: 1.0
for > 60
In [1]:
1-0.5398
Out[1]: 0.46020000000000005
Notation - U(a,b)
Example:
The number of candies sold daily at a shop is uniformly distibuted with minimum of 10 and
maximum of 40
Example :
Rolling a Dice. Outcomes can be {1,2,3,4,5,6}
To calculate a z-score, you first need to find the mean and standard deviation of the data
set. The mean is the average of the data points, and the standard deviation is a measure of
how spread out the data points are. Once you have the mean and standard deviation, you
can calculate the z-score for each data point using the following formula:
where:
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 6/8
5/26/23, 9:44 PM Notebooks
z is the z-score
x is the data point
μ is the mean
σ is the standard deviation
Z-scores are important because they allow you to compare data points that have been
measured on different scales. For example, if you have a data set of heights and a data set
of test scores, you can use z-scores to compare the heights of the students to the test
scores of the students. This is because z-scores are a measure of how far away a data point
is from the mean, regardless of the scale that the data was measured on.
Z-scores are also used in statistical tests to compare the means of two or more groups. For
example, you could use a z-test to compare the heights of students in different grades. The
z-test would calculate the z-scores for the heights of the students in each grade, and then
it would compare the z-scores to see if there is a significant difference between the heights
of the students in the different grades.
In other words, the CLT describes the behavior of sample means or sums from any
distribution as the sample size becomes large, and the resulting distribution tends to be
normal, regardless of the underlying distribution. This theorem is essential because it
allows us to make inferences about the population from a relatively small sample.
Inferential Statistics: The CLT provides the foundation for inferential statistics, which is
the process of making inferences about a population based on a sample. Inferential
statistics are widely used in many fields, including business, economics, biology, and
social sciences.
Real-world Applications: The CLT has significant applications in the real world. It is
widely used in quality control, finance, engineering, and medical research, where it
helps in assessing the accuracy of measurements, estimating population parameters,
and determining sample sizes.
Independent and Identically Distributed (IID) Variables: The random variables being
summed or averaged must be independent of each other. Additionally, they should be
drawn from the same underlying distribution, meaning they have identical probability
distributions.
Finite Variance: The variables must have a finite variance (a measure of the dispersion
of the data). If the variance is infinite or does not exist, the CLT may not hold.
Sample Size: The CLT is applicable as the sample size increases. As the sample size
grows larger, the approximation to a normal distribution becomes more accurate.
The End
Point Estimate: Single numerical value used to estimate population parameter. Example:
Sample mean is a point estimate of population mean.
Interval Estimate: Range of value used to estimate the unknown population parameter.
Interval estimates of population parameter are called as confindence intervals.
sample_mean = 50
standard_deviation = 10
sample_size = 25
# Sigma = 5% = 0.5
#Confidnece_Interval=0.95
In [2]:
def estimate(sample_mean, standard_deviation,Sample_size ):
# assuming C.I as 95% Z-score will be 1.96
lower_confidnece_interval= sample_mean -(1.96 * (standard_deviation/ sample_s
upper_confidnece_interval= sample_mean +(1.96 * (standard_deviation/ sample_s
return "I am 95% confident that the mean lies between [{:.2f}, {:.2f}]".forma
In [3]:
estimate(50,10,25)
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 1/13
5/26/23, 9:45 PM Notebooks
Out[3]: 'I am 95% confident that the mean lies between [46.08, 53.92]'
Defination:
Hypothesis testing is a statistical method used to make inferences and draw conclusions
about a population based on a sample of data. It involves formulating two competing
hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and conducting
statistical tests to determine the likelihood of the observed data supporting one
hypothesis over the other.
Uses:
Hypothesis testing is used to assess the validity of assumptions or claims about a
population parameter or the relationship between variables. It helps researchers and
analysts make evidence-based decisions by providing a systematic framework for
evaluating the significance and reliability of their findings. By testing hypotheses, we can
determine if there is sufficient evidence to support a particular claim or if the observed
results are simply due to chance.
Importance:
The importance of hypothesis testing lies in its ability to provide a scientific and objective
approach to decision-making. It helps in validating or refuting research hypotheses,
determining the effectiveness of treatments or interventions, and assessing the significance
of relationships or differences between variables. Hypothesis testing allows us to draw
meaningful conclusions from data, make informed decisions, and contribute to the
advancement of knowledge in various fields, including science, medicine, business, and
social sciences.
Step 1 :
Null Hypotheisis [H0] :
The average weight of male college students is greater than the average weight of female
college students.
Step 2 :
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 2/13
5/26/23, 9:45 PM Notebooks
Alternate Hypotheisis[H1] :
The average weight of male college students is not greater than the average weight of
female college students.
Step 3 :
95 % Confindence Interval . C.I= 0.95 Sigma =0.5 1-0.5=0.9750 Z-score of 0.9750 is 1.96
Step 4 :
Statistical Analysis:
Statistical Analysis can be done using the following formula
Z-test=
1] Null Hypothesis:
The null hypothesis (H0) is a statement of no effect, no difference, or no relationship
between variables. It assumes that any observed differences or relationships are due to
chance or random variation.
2] Alternative Hypothesis:
The alternative hypothesis (Ha or H1) is a statement that contradicts or negates the null
hypothesis. It suggests that there is an effect, a difference, or a relationship between
variables that is not due to chance.
Null Hypothesis: The mean test scores of students who receive tutoring are the same as
those who do not receive tutoring.
Alternative Hypothesis: The mean test scores of students who receive tutoring are different
from those who do not receive tutoring.
Example 2:
Example 3:
Null Hypothesis: The marketing campaign did not lead to an increase in product sales.
First Step
Null Hypothesis (H0):
This states that there is no significant difference between the population parameter and
the hypothesized value.
Second Step
Alternative Hypothesis (H1):
This states that there is a significant difference between the population parameter and the
hypothesized value.
Third Step
Set the Significance Level (α):
The significance level, denoted by α, determines the probability of rejecting the null
hypothesis when it is actually true. Commonly used significance levels are 0.05 (5%) or 0.01
(1%).
Fourth Step
Conduct the Test:
Calculate the test statistic: For a z-test, it is the observed sample statistic minus the
hypothesized population parameter, divided by the standard error.
Determine the critical value: The critical value(s) is obtained from the z-table or a statistical
software based on the chosen significance level. Compare the test statistic with the critical
value(s): If the test statistic falls in the critical region (beyond the critical value(s)), the null
hypothesis is rejected. Otherwise, it is not rejected.
Fifth Step
Make a Conclusion:
Based on the comparison in step 3, either reject the null hypothesis or fail to reject it. If the
null hypothesis is rejected, it suggests that there is sufficient evidence to support the
alternative hypothesis. If the null hypothesis is not rejected, it means there is insufficient
evidence to support the alternative hypothesis.
Definition:
The p-value is a statistical measure that quantifies the strength of evidence against the null
hypothesis in hypothesis testing. It represents the probability of obtaining a test statistic as
extreme as, or more extreme than, the observed value, assuming that the null hypothesis is
true.
If the p-value is less than the predetermined significance level (α), typically 0.05 or 0.01, it
suggests that the observed data is statistically significant. This means that the observed
result is unlikely to have occurred by chance alone, leading to the rejection of the null
hypothesis in favor of the alternative hypothesis.
If the p-value is greater than the significance level (α), it indicates that the observed data
does not provide strong evidence against the null hypothesis. In this case, the null
hypothesis is not rejected, and it is concluded that the observed result could plausibly
occur due to chance or random variability.
The smaller the p-value, the stronger the evidence against the null hypothesis. A very small
p-value indicates a low probability of observing the data if the null hypothesis is true,
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 6/13
5/26/23, 9:45 PM Notebooks
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
"""The degrees of freedom (df) parameter determines the shape of the Student's t-
In this code, we set it to 10."""
df = 10
"""We use np.linspace to create an array of 100 equally spaced x values that span
The t.ppf function is used to calculate the percent point function (inverse cumul
for the given degrees of freedom."""
x = np.linspace(t.ppf(0.001, df), t.ppf(0.999, df), 100)
"""We use t.pdf to compute the probability density function (PDF) for the Student
given the x values and degrees of freedom."""
pdf = t.pdf(x, df)
t=(1.3-1.6)/sqrt((0.25/22)+(0.09/22))
dof = 22+22-2
print ( t, p_value )
-2.4131989996195315 0.020254894248790345
The t-distribution is used in hypothesis testing and constructing confidence intervals when
the sample size is small (typically less than 30) and the population standard deviation is
unknown, or when the population is approximately normally distributed or the sample size
is large enough to apply the Central Limit Theorem. It provides critical values and
probabilities for t-tests, which compare sample means and assess the significance of the
difference between them.
In this formula, the sample statistic represents the observed value from the sample data
(e.g., sample mean), the hypothesized parameter is the value assumed under the null
hypothesis (e.g., population mean), and the standard error measures the variability or
uncertainty in the sample statistic. The t-statistic indicates how many standard errors the
sample statistic deviates from the hypothesized parameter. By comparing the calculated t-
value to critical values from the t-distribution, we can determine the statistical significance
of the observed difference or relationship.
In [8]:
# Lets consider the following data.
n = 50
sample_mean = 500
std = 50
confidence_interval = 95
sigma = 0.05
print("I am 95% confident that the population mean lies between",lower_ci ," and
I am 95% confident that the population mean lies between 486.1407070887437 and 51
3.8592929112564
Alternate Hypothesis: New drug will not decrease blood pressure by 10 mmHg.
1-0.05/2 =0.9750
Std = 3
n=100
Sample_mean= 8
In [20]:
from math import sqrt
In [10]:
Z_test= (8-10)/ (3/(sqrt(100) ))
In [11]:
Z_test
Out[11]: -6.666666666666667
In [12]:
if Z_test < -1.96:
print ("We reject the null hypothesis")
In [13]:
t= (4.8-5) / (0.5 /sqrt(25))
In [14]:
t
Out[14]: -2.0000000000000018
In [15]:
if t <-2.797:
Print("We Reject the null hypothesis")
elif t > 2.7979:
print("We Reject the null hypothesis")
else:
print("We fail to reject the null hypothesis")
Alternate Hypothesis: The population means for the two groups are not equal.
In [16]:
t=(80 - 75)/( sqrt(((10**2)/ 30)+((8**2)/40)))
In [17]:
t
Out[17]: 2.2511258444537408
In [19]:
if t > 2.663 :
print("We reject the Null Hypothesis")
elif t < -2.663:
print("We reject the Null Hypothesis")
else :
print ("We fail to reject the Null Hypothesis")
1-0.01/2 =0.999
In [23]:
Lower_Confidence_Interval = 4 - (2.680 * 0.2121)
Upper_Confidence_Interval = 4 + 0.5685
In [33]:
print("I am 99% confident that the population mean lies between ", Lower_Confiden
I am 99% confident that the population mean lies between 3.431572 and 4.5685
THE END
if No:
then you should use T-test.
if Yes:
Is the sample size greater than 30?
In a one-tailed test, the critical region is located in one tail of the distribution, either the
left or the right. This means that the test is only looking for evidence of a difference in one
direction. For example, a one-tailed test could be used to determine whether a new drug is
more effective than a placebo in reducing pain. The critical region would be located in the
right tail of the distribution, because the researchers are only interested in finding evidence
that the drug is more effective than the placebo. For example: Does taking a new
medication improve the patient's condition?
In a two-tailed test, the critical region is located in both tails of the distribution. This means
that the test is looking for evidence of a difference in either direction. For example, a two-
tailed test could be used to determine whether there is a difference in the average height
of men and women. The critical region would be located in both the left and right tails of
the distribution, because the researchers are interested in finding evidence that men are
taller than women, or that women are taller than men.
For example:
Does taking a new medication improve the patient's condition, regardless of whether the
patient's condition is getting better or worse?
Outcomes:
Outcome 1: We reject the Null hypothesis when Null hypothesis is False in Reality.
{Good}
Outcome 2: We reject the Null hypothesis when Null hypothesis is True in Reality.
{Type 1 Error}
Example: A researcher conducts a study to test the effectiveness of a new drug. The
researcher rejects the null hypothesis and concludes that the drug is effective. However, the
drug is actually not effective and the researcher has made a Type I error.
Outcome 3: We accept the Null hypothesis when Null hypothesis is False in Reality.
{Type 2 Error}
Example:A researcher conducts a study to test the effectiveness of a new drug. The researcher
does not reject the null hypothesis and concludes that the drug is not effective. However, the
drug is actually effective and the researcher has made a Type II error.
Outcome 4 : We accept the Null hypothesis when Null hypothesis is True in Reality.
{Good}
women. The null hypothesis was true in reality. There was actually no difference in the
average height of men and women. The researcher correctly accepted the null hypothesis and
concluded that there is no difference in the average height of men and women.
P(A|B) is the probability of event A occurring, given that event B has occurred
P(B|A) is the probability of event B occurring, given that event A has occurred
P(A) is the probability of event A occurring
P(B) is the probability of event B occurring
For example, let's say that you have a bag of marbles that contains 10 red marbles and 10
blue marbles. You reach into the bag and pull out a marble without looking. What is the
probability that the marble you pulled out is red?
Without any other information, we can say that the probability of pulling out a red marble
is 50%. This is because there are an equal number of red and blue marbles in the bag.
Now, let's say that you look at the marble and see that it is red. What is the probability that
the marble you pulled out is red now?
Using Bayes' theorem, we can update our probability to 75%. This is because we now know
that the marble is red, and we also know that there are more red marbles in the bag than
blue marbles.
Formula : CI = x̄ ± zα/2 * σ / √n
where:
For example, a 95% confidence interval means that there is a 95% chance that the
confidence interval will contain the true population mean.
For example, let's say that we want to calculate a 95% confidence interval for the average
height of men. We know that the sample mean is 68 inches, the sample standard deviation
is 2 inches, and the sample size is 100.
In [4]:
from math import sqrt
# CI = 68 ± 1.96 * 2 / √100
Lower_CI=68 - 1.96 * 2 / sqrt(100)
Upper_CI=68 + 1.96 * 2 / sqrt(100)
Lower_CI,Upper_CI
This means that we can be 95% confident that the true average height of men is between
67.61 inches and 68.39 inches.
Problem:
A certain disease affects 1 in every 1000 people. A test has been developed to detect the
disease, and it is known to have a 95% accuracy rate (i.e. if a person has the disease, the
test will correctly identify it 95% of the time, and if a person does not have the disease, the
test will correctly identify it as negative 95% of the time). If a randomly selected person
tests positive for the disease, what is the probability that they actually have the disease?
A: The person has the disease. B: The person tests positive for the disease.
P(A) = 1/1000
P(B|A) = 0.95
P(B|not A) = 0.05
We want to find: P(A|B) (the probability of the person having the disease given that they
tested positive)
In [13]:
probability_A_given_B=((0.95) *(1/1000)) / (0.05185)
Therefore, if a randomly selected person tests positive for the disease, the probability that
they actually have the disease is approximately 1.83%.
CI = x̄ ± zα/2 * σ / √n
where:
In [24]:
# Assuming n = 1000
Lower_Ci= 50 - 1.96 * 5 / 1000
Upper_Ci= 50 + 1.96 * 5 / 1000
print("I am 95% confident that the mean lies between " + str(Lower_Ci) + " and "
I am 95% confident that the mean lies between 49.9902 and 50.0098.
A larger sample size tends to result in a smaller margin of error. This is because as the
sample size increases, the standard error decreases. The standard error is inversely
proportional to the square root of the sample size. Therefore, a larger sample size leads to
a more precise estimate and a narrower confidence interval.
Margin_of_error= zα/2 * σ / √n
CI = x̄ ± Margin_of_error
Lets consider 2 scenarios:
n=100
n=1000
In [31]:
Margin_of_Error_1= 1.96 * 5 / 500
print("Result for Scenario 1 : " + str(Margin_of_Error_1))
Margin_of_Error_2= 1.96 * 5 / 100
print("Result for Scenario 2 : " + str(Margin_of_Error_2))
Q9. Calculate the z-score for a data point with a value of 75,
a population mean of 70, and a population standard
deviation of 5. Interpret the results.
z = (x - μ) / σ
Where:
In [1]:
z=(75-70)/5
In [2]:
z
Out[2]: 1.0
This means that the Datapoint 75 is 1 standard deviation away from the mean.
Decision Rule : if the t-test is greater than 2.045 or lesser than -2.045, reject the null
hypothesis.
t = (x̄ - μ) / (s / √n)
In [2]:
from math import sqrt
t = (6 - 0) / (2.5 / sqrt(50))
In [4]:
if t > 2.045 :
print("We reject the Null Hypothesis")
elif t < -2.045:
print("We reject the Null Hypothesis")
else:
print("We fail to reject the Null Hypothesis ")
CI = x̄ ± zα/2 * σ / √n
where:
In [6]:
Standard_Error = sqrt((0.65 * (1 - 0.65)) / 500)
In [7]:
lower_ci = 65 - 1.96* Standard_Error
In [8]:
upper_ci=65 + 1.96 * Standard_Error
In [13]:
print("I am 95% confident that the population mean lies between "+ str(lower_ci)+
I am 95% confident that the population mean lies between 64.95819177114491 and 65.0
4180822885509
assuming n to be 100
Degree of freedom : 100 + 100 -2 =198
The critical t-value for α/2 = 0.005 and df = 198 is approximately ±2.617.
Decision Rule : If the t-test is lesser than -2.617 or greater than +2.617, We reject the
Null hypothesis.
calculating t-statistics
In [1]:
from math import sqrt
t=(85-82)/ sqrt (((6**2)/100)+((5**2)/100) )
In [4]:
if t < -2.617:
print("We reject the Null hypothesis.")
elif t> +2.617:
print("We reject the Null hypothesis.")
else:
print("We fail to reject the Null hypothesis.")
Conclusion:
Based on the given data and conducting the two-sample t-test, we have evidence to
suggest that there is a significant difference in student performance between the two
teaching methods at a significance level of 0.01.
In [5]:
Lower_ci= 65 -( 1.645* (8 /sqrt(50)))
In [6]:
Upper_ci= 65 +( 1.645* (8 /sqrt(50)))
In [9]:
print("I am 95% confident that the population mean lies between "+ str(Lower_ci)+
I am 95% confident that the population mean lies between 63.13889495191701 and 66.8
6110504808299
assuming n to be 100
Degree of freedom : 30-1=29
Decision Rule : If the t-test is lesser than -2.045 or greater than + 2.045 , We reject the
Null hypothesis.
In [23]:
t = (0.25 - 0) / (0.05 / sqrt(30))
In [22]:
if t < -2.045:
print("We reject the Null hypothesis.")
elif t> +2.045:
print("We reject the Null hypothesis.")
else:
print("We fail to reject the Null hypothesis.")
THE END
Assuming n to be 100
In [1]:
from math import sqrt
Lower_ci= 50 - ( 1.96* (5/sqrt(100)))
Upper_ci= 50 + ( 1.96* (5/sqrt(100)))
In [2]:
print("I am 95% confident that the population mean lies between "+ str(Lower_ci)+
I am 95% confident that the population mean lies between 49.02 and 50.98
In [3]:
import pandas as pd
df=pd.DataFrame({"Colours":["Blue","Orange","Green","Yellow","Red","Brown"],"Expe
In [4]:
sum(df['Observed_Data'])
Out[4]: 300
In [5]:
df['Expected_Data'] = (df['Expected_Data(in %)']/100 )*300
In [6]:
df
0 Blue 20 45 60.0
1 Orange 20 55 60.0
2 Green 20 50 60.0
3 Yellow 10 30 30.0
4 Red 10 25 30.0
5 Brown 20 95 60.0
In [7]:
import scipy.stats as stat
chisquare_test_statistics,p_value=stat.chisquare(df['Observed_Data'],df['Expected
In [8]:
chisquare_test_statistics,p_value
In [9]:
# find the critical value
significance_value=0.05
dof=len(df['Expected_Data']) -1
crtitcal_value= stat.chi2.ppf(1-significance_value,dof)
In [10]:
if chisquare_test_statistics > crtitcal_value:
print ("we reject the null hypothesis")
else :
print ("we fail to reject the null hypothesis")
In [12]:
import pandas as pd
df1 = pd.DataFrame(data)
df1 = df1.set_index('Outcome')
In [13]:
df1
Outcome
Outcome1 20 15
Outcome2 10 25
Outcome3 15 20
In [14]:
from scipy.stats import chi2_contingency
sample_size = 500
sample_proportion = 60 / sample_size
confidence_level = 0.95
Margin_of_error= zα/2 * σ / √n
CI = x̄ ± Margin_of_error
In [17]:
# Calculate the margin of error
margin_of_error = stat.norm.ppf(1 - (1 - confidence_level) / 2) * standard_error
confidence_level = 0.9
In [19]:
from math import sqrt
mean=75
standard_deviation=12
confidence_level = 0.90
sample_size=100
In [20]:
lower_bound = mean - margin_of_error
upper_bound = mean + margin_of_error
In [21]:
print("Confidence Interval: [{:.4f}, {:.4f}]".format(lower_bound, upper_bound))
In [23]:
df = 10
defines the degrees of freedom (df) for the chi-square distribution. In this case, we set it to
10.
In [24]:
x = np.linspace(0, 30, 500)
generates an array of 500 evenly spaced values ranging from 0 to 30. These values will be
used as the x-axis values for plotting the chi-square distribution.
In [25]:
pdf = stats.chi2.pdf(x, df)
This line calculates the probability density function (PDF) of the chi-square distribution
using the pdf() function from scipy.stats.chi2. It takes the array of x values (x) and the
degrees of freedom (df) as input and returns the corresponding PDF values.
In [26]:
plt.plot(x, pdf)
x_fill = np.linspace(15, 30, 500)
pdf_fill = stats.chi2.pdf(x_fill, df)
plt.fill_between(x_fill, pdf_fill, color='green', alpha=0.3)
plt.xlabel('Chi-square Statistic')
plt.ylabel('Probability Density Function (PDF)')
plt.show()
#Confidence_Interval is 99%
Sigma = 0.01
In [28]:
from math import sqrt
standard_error = sqrt((sample_proportion * (1 - sample_proportion)) / sample_size
In [29]:
Z_score = stats.norm.ppf(1 - (1 - 0.99) / 2)
In [30]:
margin_of_error = Z_score * standard_error
In [31]:
#Confidence_interval
lower_bound= mean - margin_of_error
upper_bound= mean + margin_of_error
lower_bound,upper_bound
Conclusion:
This means that we are 99% confident that the true proportion of people in the population
who prefer Coke falls between approximately 519.9593 and 520.0407.
In [33]:
chisquare_test_statistics,p_value
In [34]:
# find the critical value
significance_value=0.05
dof=len(expected) -1
crtitcal_value= stat.chi2.ppf(1-significance_value,dof)
In [35]:
if chisquare_test_statistics > crtitcal_value:
print ("we reject the null hypothesis")
else :
print ("we fail to reject the null hypothesis")
data = {
"Status": ["Smoker", "Non-smoker"],
"Lung Cancer Yes": [60, 30],
"Lung Cancer No": [140, 170]
}
df = pd.DataFrame(data)
df = df.set_index('Status')
df
Status
Smoker 60 140
Non-smoker 30 170
In [37]:
from scipy.stats import chi2_contingency
In [39]:
alpha = 0.05
Reject the null hypothesis. There is a significant association between smoking stat
us and lung cancer diagnosis.
In [43]:
observed=[[200 ,150, 150],
[225, 175, 100]]
In [46]:
alpha = 0.01
# Sample statistics
sample_mean = 72
sample_std = 10
sample_size = 30
# Significance level
alpha = 0.05
Fail to reject the null hypothesis. There is not enough evidence to conclude that t
he population mean is significantly different from 70.
t-statistic: 1.0954451150103321
p-value: 0.28233623728606977
THE END
Assumptions:
1. Normality of Sampling Distribution: The sample mean is normally distributed. Example
of Violation: if the data is highly skewed to one side, it may violate the normality
assumption.
2. Absence of Outliers: Any Outliers present in the data should be removed. Example of
Violation:if the data contains outliers, it may violate the normality assumption.
3. Homogenity of Variance: Each of the population has same variance. (σ1** 2) = (σ2** 2)
= (σ3** 2) Population variance in different level of each independent variable are
equal. Example of Violation: if one group has much larger variability compared to the
others, the assumption is violated.
4. Sample are Independent and Random: Example of Violation: if the same individuals are
measured multiple times in different groups, the independence assumption is violated.
1] One Way ANOVA : One factor with atleast 2 levels, these levels are
independent.
eg: Doctor wants to test a new medication to decreasen headache.They split the partipants in
3 condtions [10mg, 20mg, 30mg]. Doctor asks the patients to rate thier headache on a scale
of [1-10]. In this example, Medication is a factor and 3 conditions are 3 levels.
2] Repeated Measures ANOVA : One factor with atleast 2 levels but ythe
levels are dependent.
eg: Consider a factor Running and levels as Day1, Day2 and Day3.
Within-group variance: This measures the variation within each group. It represents
the random or unexplained differences within the groups that are not related to the
factor we are studying.
Total variance: This is the overall variability in the data, which includes both the
differences between groups and the differences within groups.
Assess the strength of the effect: We can see how strong the relationship is between
the factor we are studying and the outcome.
Test hypotheses and draw conclusions: We can determine if the differences between
groups are statistically significant.
Q4. How would you calculate the total sum of squares (SST),
explained sum of squares (SSE), and residual sum of squares
(SSR) in a one-way ANOVA using Python?
In [1]:
Group1=[23,25,18]
Group2=[29,19,21]
Group3=[35,17]
In [2]:
import numpy as np
In [3]:
mean1=np.mean(Group1)
mean2=np.mean(Group2)
mean3=np.mean(Group3)
In [4]:
Grand_mean = (np.sum(Group1) + np.sum (Group2) + np.sum (Group3)) / ((len(Group1)
In [5]:
Grand_mean
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 2/13
5/26/23, 9:52 PM Notebooks
Out[5]: 23.375
For each subject, compute the difference between its score and its group mean. You
thus have to compute each of the group means, and compute the difference between
each of the scores and the group mean to which that score belongs
Square all these differences
Sum the squared differences
In [6]:
sum1=[]
sum2=[]
sum3=[]
for i in Group1:
sum1.append((i - mean1)**2)
for i in Group2:
sum2.append((i - mean2)**2)
for i in Group3:
sum3.append((i - mean3)**2)
In [7]:
SSW=np.sum(sum1) + np.sum (sum2) + np.sum(sum3)
SSW
Out[7]: 244.0
For each subject, compute the difference between its group mean and the grand
mean. The grand mean is the mean of all N scores (just sum all scores and divide by
the total sample size )
Square all these differences
Sum the squared differences
In [8]:
ssb1=[]
ssb2=[]
ssb3=[]
for i in Group1:
ssb1.append((mean1- Grand_mean)**2)
for i in Group2:
ssb2.append((mean2- Grand_mean)**2)
for i in Group3:
ssb3.append((mean3- Grand_mean)**2)
In [9]:
SSB=np.sum(ssb1) + np.sum (ssb2) + np.sum(ssb3)
In [10]:
SSB
Out[10]: 19.875
For each subject, compute the difference between its score and the grand mean
Square all these differences
Sum the squared differences
In [11]:
sst1=[]
sst2=[]
sst3=[]
for i in Group1:
sst1.append((i- Grand_mean)**2)
for i in Group2:
sst2.append((i- Grand_mean)**2)
for i in Group3:
sst3.append((i- Grand_mean)**2)
In [12]:
SST= np.sum(sst1) + np.sum (sst2) + np.sum(sst3)
In [13]:
SST
Out[13]: 263.875
If you have computed two of the three sums of squares, you can easily computed the third
one by using the fact that SST = SSW + SSB.
In [15]:
# Create sample data for two factors (A and B) and the response variable (Y)
A = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
B = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
In [16]:
# Combine the data into a DataFrame
data = pd.DataFrame({'A': A, 'B': B, 'Y': Y})
In [17]:
data
Out[17]: A B Y
0 1 1 10
1 2 1 12
2 3 1 14
3 4 1 16
4 5 1 18
A B Y
5 1 2 20
6 2 2 22
7 3 2 24
8 4 2 26
9 5 2 28
In [18]:
# Calculate the main effects and interaction effect
main_effect_A, pvalue_A = stats.f_oneway(data[data['B'] == 1]['Y'],
data[data['B'] == 2]['Y'])
F-statistic:
The F-statistic (5.23 in this case) measures the extent of differences between the groups in
the one-way ANOVA. A higher F-statistic suggests that the differences between the groups
are more noticeable and significant. p-value:
P-value:
The p-value (0.02 in this case) provides a measure of the strength of evidence against the
null hypothesis. A p-value of 0.02 means that there is a 2% chance of obtaining such a
large F-statistic if there were actually no real differences between the groups.
Interpretation
The obtained F-statistic of 5.23 indicates that there are noticeable differences between the
groups. These differences are not likely to occur by random chance alone. Statistical
significance:
P-value:
The low p-value of 0.02 suggests strong evidence against the null hypothesis. It indicates
that the observed differences between the groups are unlikely to be due to random
variation alone.
df between = a-1 =2
df within = N-a = 150 -3 = 147
df total = 150-1 = 149
In [20]:
import scipy.stats as stat
#Degrees of freedom
df_between = 2
df_within = 147
df_total = 149
In [21]:
mean1=np.mean(A)
mean2=np.mean(B)
mean3=np.mean(C)
In [22]:
Grand_mean = (np.sum(A) + np.sum (B) + np.sum (C)) / ((len(A)+len(B)+len(C)))
In [23]:
# Calculating Sum of Square Within
ssw1=[]
ssw2=[]
ssw3=[]
for i in A:
ssw1.append((i - mean1)**2)
for i in B:
ssw2.append((i - mean2)**2)
for i in C:
ssw3.append((i - mean3)**2)
In [24]:
SSW=np.sum(ssw1) + np.sum (ssw2) + np.sum(ssw3)
SSW
Out[24]: 201.88
In [25]:
# Calculating Sum of Square Between
ssb1=[]
ssb2=[]
ssb3=[]
for i in A:
ssb1.append((mean1- Grand_mean)**2)
for i in B:
ssb2.append((mean2- Grand_mean)**2)
for i in C:
ssb3.append((mean3- Grand_mean)**2)
SSB=np.sum(ssb1) + np.sum (ssb2) + np.sum(ssb3)
SSB
Out[25]: 0.2800000000000005
In [26]:
# Calculating Sum of Square Total
sst1=[]
sst2=[]
sst3=[]
for i in A:
sst1.append((i- Grand_mean)**2)
for i in B:
sst2.append((i- Grand_mean)**2)
for i in C:
sst3.append((i- Grand_mean)**2)
In [27]:
SST= np.sum(sst1) + np.sum (sst2) + np.sum(sst3)
In [28]:
SST
Out[28]: 202.16000000000003
In [29]:
## Another method for SST
SSW + SSB
Out[29]: 202.16
In [30]:
# Mean of Squares
Ms_between = SSB/df_between
Ms_within = SSW/df_within
Ms_total = SST/df_total
In [31]:
f = Ms_between / Ms_within
In [32]:
import scipy.stats as stats
p-value: 0.903145943262158
In [33]:
if f > 3.057620651649394:
print ("Reject the Null Hypothesis.")
else :
print ("We Fail to Reject the Null Hypothesis.")
The one-way ANOVA results indicate that we fail to reject the null hypothesis. This means
that there is not enough evidence to conclude that there are significant differences
between the mean weight loss of the three diets (A, B, and C).
0 12 A Novice
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 9/13
5/26/23, 9:52 PM Notebooks
1 15 B Experienced
2 18 C Novice
3 14 A Experienced
4 16 B Novice
In [35]:
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
In [36]:
model = ols("Time ~ Program + Experience + Program:Experience", data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
To interpret the results, focus on the p-values. A p-value less than the significance level (e.g.,
0.05) suggests that there is a significant effect.
If the p-value for the "Program" factor is significant, it indicates that there are
significant differences in the average time to complete the task among the three
software programs (A, B, and C).
If the p-value for the "Experience" factor is significant, it suggests that there are
significant differences in the average time to complete the task between novice and
experienced employees.
If the p-value for the interaction effect is significant, it implies that the effect of the
software program on the time to complete the task differs depending on the
employee's experience level.
In [37]:
print(anova_table)
# Example interpretation
if anova_table['PR(>F)']['Program'] < 0.05:
print("There is a significant difference in the average time to complete the
sum_sq df F PR(>F)
Program 1.866667 2.0 0.173375 0.841866
Experience 0.033333 1.0 0.006192 0.937932
Program:Experience 69.066667 2.0 6.414861 0.005863
Residual 129.200000 24.0 NaN NaN
There is a significant interaction effect between the software programs and employe
e experience level.
In [39]:
# Generate random test scores for the control and experimental groups
np.random.seed(42) # Set a seed for reproducibility
control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=10, size=100)
In [40]:
# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)
alpha = 0.05 # Significance level
In [41]:
print("Two-sample t-test results:")
print(f"t-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
In [42]:
if p_value < alpha:
print("There is a significant difference in test scores between the control a
else:
print("There is no significant difference in test scores between the control
There is a significant difference in test scores between the control and experiment
al groups.
In [43]:
df = pd.DataFrame({
'Test Scores': np.concatenate([control_scores, experimental_scores]),
'Group': np.concatenate([np.repeat('Control', len(control_scores)),
np.repeat('Experimental', len(experimental_scores))]
})
In [44]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
store_a=np.random.randint(10000,25000,30)
store_b=np.random.randint(10000,20000,30)
store_c=np.random.randint(10000,30000,30)
In [46]:
# Create a dataframe with the sales data
df = pd.DataFrame({
'Day': list(range(1, 31)) * 3, # Day numbers
'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30, # Store labels
'Sales': [120, 135, 128, 110, 115, 130, 125, 132, 122, 130, 133, 140, 120, 12
150, 142, 138, 145, 148, 135, 138, 132, 130, 128, 120, 125, 170, 17
210, 198, 202, 190, 205, 192, 158, 160, 168, 155, 150, 165, 132, 13
125, 120, 122, 118, 115, 155, 152, 148, 150, 160, 158, 240, 235, 23
172, 180, 185, 188, 195, 168, 175, 172, 170, 162, 158, 198, 205, 21
})
In [47]:
print(df.head())
In [48]:
import statsmodels.api as sm
In [49]:
from statsmodels.formula.api import mixedlm
In [50]:
df['Store'] = df['Store'].astype('category')
df['Day'] = df['Day'].astype('category')
The End
df1=len(data1)-1
df2=len(data2)-1
p_value = f.sf(f_test, df1, df2)
In [2]:
test([10,20,34,56,78,64,98,56,53,75],[90,87,64,34,12,56,77,45])
In [4]:
test2(23,29)
Out[4]: 1.9102874554747564
df1 = len(sample1 ) - 1
df2 = len(sample2 ) - 1
print ("f-value:", f_value ," df1:", df1, " df2: ", df2, " p-value: ", p_v
In [6]:
import numpy as np
sample1 = np.random.normal(size=30)
sample2 = np.random.normal(size=50)
test3(sample1, sample2)
# conclusion
if f_value > critical_value:
print("There is a Significant difference.")
else :
print("There is no Significant difference.")
# calculate mean
if dfd > 2:
mean = dfd / (dfd - 2)
else:
mean = float('inf')
# calculate variance
if dfd> 4:
variance = (2 * (dfd** 2) * (dfn+ dfd- 2)) / (dfn* (dfd- 2) ** 2 * (
elif dfd<= 4 and dfd> 2:
variance = float('inf')
else:
variance = float('nan')
In [10]:
test4(29,49)
n2= 15
sample_variance2 = 20
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 7.ipynb 3/5
5/26/23, 9:52 PM Notebooks
alpha = 0.10
#Performing tests
# conclusion
if f_value > critical_value or f_value < (1 / critical_value):
print("There is a Significant difference.")
else :
print("There is no Significant difference.")
dfn = len(Restaurant_A)-1
dfd = len(Restaurant_B)-1
In [13]:
import numpy as np
import scipy.stats as stat
from scipy.stats import f
# conclusion
if f_value > critical_value or f_value < (1 / critical_value):
print("There is a Significant difference.")
else :
print("There is no Significant difference.")
alpha = 0.01
#calculations
f_value = np.var(Group_A) / np.var(Group_B)
critical_value = stat.f.ppf(q=1-alpha, dfn=df1, dfd=df2)
# conclusion
if f_value > critical_value or f_value < (1 / critical_value):
print("There is a Significant difference.")
else :
print("There is no Significant difference.")
The End