Statistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 163

5/26/23, 9:25 PM 1.

1Measurement of Central Tendency

Measurement of Central Tendency


In [1]: import numpy as np
import seaborn as sns

In [2]: #mean(Average)
age=[12,32,45,67,80,75,35,16,18,200]

In [3]: np.mean(age)

Out[3]: 58.0

In [4]: weights=[56,57,75,46,56,78,80,85]

In [5]: np.mean(weights)

Out[5]: 66.625

In [6]: df=sns.load_dataset('tips')

In [7]: np.mean(df['total_bill'])

Out[7]: 19.78594262295082

In [8]: #median
np.median(age)

Out[8]: 40.0

In [9]: from scipy import stats

In [10]: stats.mode(age)

/tmp/ipykernel_213/2474845003.py:1: FutureWarning: Unlike other reduction funct


ions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preser
ves the axis it acts along. In SciPy 1.11.0, this behavior will change: the def
ault value of `keepdims` will become False, the `axis` over which the statistic
is taken will be eliminated, and the value None will no longer be accepted. Set
`keepdims` to True or False to avoid this warning.
stats.mode(age)
Out[10]: ModeResult(mode=array([12]), count=array([1]))

In [11]: np.median(df['total_bill'])

Out[11]: 17.795

https://green-florist-pbqnj.pwskills.app/lab/workspaces/auto-g/tree/work/1.1Measurement of Central Tendency.ipynb 1/1


5/26/23, 9:26 PM 2.1Measure of Dispersion

Measurement of Dispersion
In [1]: ages_lst=[24,12,34,55,67,86,54,64,21,9,75,50]

In [2]: import numpy as np

In [3]: import seaborn as sns

In [4]: np.mean(ages_lst)

Out[4]: 45.916666666666664

In [5]: np.var(ages_lst)

Out[5]: 595.4097222222222

In [6]: np.std(ages_lst)

Out[6]: 24.401018876723615

In [7]: sns.histplot(ages_lst,kde=True)

Out[7]: <AxesSubplot: ylabel='Count'>

In [8]: import pandas as pd

In [9]: Data=[[18,25,45],[67,87,43],[23,90,65]]
df=pd.DataFrame(Data,columns=['A','B','c'])

https://green-florist-pbqnj.pwskills.app/lab/workspaces/auto-g/tree/work/2.1Measure of Dispersion.ipynb 1/2


5/26/23, 9:26 PM 2.1Measure of Dispersion

In [10]: df.mean()

Out[10]: A 36.000000
B 67.333333
c 51.000000
dtype: float64

In [11]: df.median()

Out[11]: A 23.0
B 87.0
c 45.0
dtype: float64

In [12]: df.std()

Out[12]: A 26.962938
B 36.692415
c 12.165525
dtype: float64

In [13]: df.var()

Out[13]: A 727.000000
B 1346.333333
c 148.000000
dtype: float64

In [14]: df.var(axis=1)

Out[14]: 0 196.333333
1 485.333333
2 1146.333333
dtype: float64

In [15]: df.std(axis=1)

Out[15]: 0 14.011900
1 22.030282
2 33.857545
dtype: float64

In [ ]:

https://green-florist-pbqnj.pwskills.app/lab/workspaces/auto-g/tree/work/2.1Measure of Dispersion.ipynb 2/2


5/26/23, 9:27 PM 3.1 Chi Square

Chi Square Test With Python.


In [1]: import scipy.stats as stat
import numpy as np

In [2]: #No. of hours students study in a weekly basis daily


# monday, tuesday, wednesday, thursday, friday, saturday,sunday
expected_data= [8,6,7,8,6,9,7]
observed_data=[7,8,6,8,9,6,7]

In [3]: sum(expected_data),sum (observed_data)

Out[3]: (51, 51)

In [4]: ## Chi- Square Goodness of fit test


chisquare_test_statistics,p_value=stat.chisquare(observed_data,expected_data)

In [5]: chisquare_test_statistics,p_value

Out[5]: (3.4345238095238093, 0.7526596580922865)

In [6]: # find the critical value


significance_value=0.05
dof=len(expected_data) -1
crtitcal_value= stat.chi2.ppf(1-significance_value,dof)

In [7]: if chisquare_test_statistics > crtitcal_value:


print ("we reject the null hypothesis")
else :
print ("we fail to reject the null hypothesis")

we fail to reject the null hypothesis

In [8]: ## Expected data is correct!

THE END
In [ ]:

https://green-florist-pbqnj.pwskills.app/lab/workspaces/auto-g/tree/work/3.1 Chi Square.ipynb 1/1


5/26/23, 9:27 PM 4.1 F test with Python

F test with Python


In [1]: #Evidence to say that 2 population variances are not equal.
# Both the performance is different
worker1=[18,19,22,25,27,28,41,45,51,55]
worker2=[14,15,15,17,18,22,27,34]

In [2]: #Calculating F test


import numpy as np
f_test=np.var(worker1)/ np.var(worker2)

In [3]: f_test

Out[3]: 3.874302158273381

In [4]: # Degree of freedon


df1=len(worker1)-1
df2=len(worker2)-1
signifiacnce_value=0.05

In [5]: import scipy.stats as stat

In [6]: # Critical Value


#dfn= Degree of freedom of numerator
#dfd=Degree of freedm of denominator
critical_value=stat.f.ppf(q=1-signifiacnce_value,dfn=df1,dfd=df2)

In [7]: critical_value

Out[7]: 3.6766746989395105

In [8]: # If ftest is greater than critical value we reject the null hypothesis else we
if f_test > critical_value:
print("Reject the Null hypothesis")
else :
("Fail to reject the Null hypothesis")

Reject the Null hypothesis

In [9]: # Hence, Worker 1 is performing better than worker 2

The End

https://green-florist-pbqnj.pwskills.app/lab/workspaces/auto-g/tree/work/4.1 F test with Python.ipynb 1/1


5/26/23, 9:29 PM Covariance and correlation

Covariance and Correlation


In [1]: import seaborn as sns

In [2]: df=sns.load_dataset('healthexp')

In [3]: df.head()

Out[3]: Year Country Spending_USD Life_Expectancy

0 1970 Germany 252.311 70.6

1 1970 France 192.143 72.2

2 1970 Great Britain 123.993 71.9

3 1970 Japan 150.437 72.0

4 1970 USA 326.961 70.9

In [4]: import numpy as np

In [6]: df.cov()

/tmp/ipykernel_830/1545644723.py:1: FutureWarning: The default value of numeric


_only in DataFrame.cov is deprecated. In a future version, it will default to F
alse. Select only valid columns or specify the value of numeric_only to silence
this warning.
df.cov()
Out[6]: Year Spending_USD Life_Expectancy

Year 201.098848 2.571883e+04 41.915454

Spending_USD 25718.827373 4.817761e+06 4166.800912

Life_Expectancy 41.915454 4.166801e+03 10.733902

In [7]: #correlation-Pearson
df.corr(method='pearson')

/tmp/ipykernel_830/3272858879.py:2: FutureWarning: The default value of numeric


_only in DataFrame.corr is deprecated. In a future version, it will default to
False. Select only valid columns or specify the value of numeric_only to silenc
e this warning.
df.corr(method='pearson')
Out[7]: Year Spending_USD Life_Expectancy

Year 1.000000 0.826273 0.902175

Spending_USD 0.826273 1.000000 0.579430

Life_Expectancy 0.902175 0.579430 1.000000

In [8]: # Spearmn Rank


df.corr(method='spearman')

https://green-florist-pbqnj.pwskills.app/lab/workspaces/auto-g/tree/work/Covariance and correlation.ipynb 1/2


5/26/23, 9:29 PM Covariance and correlation

/tmp/ipykernel_830/1521701264.py:2: FutureWarning: The default value of numeric


_only in DataFrame.corr is deprecated. In a future version, it will default to
False. Select only valid columns or specify the value of numeric_only to silenc
e this warning.
df.corr(method='spearman')
Out[8]: Year Spending_USD Life_Expectancy

Year 1.000000 0.931598 0.896117

Spending_USD 0.931598 1.000000 0.747407

Life_Expectancy 0.896117 0.747407 1.000000

In [9]: df1=sns.load_dataset('flights')

In [10]: df1.cov()

/tmp/ipykernel_830/3142585312.py:1: FutureWarning: The default value of numeric


_only in DataFrame.cov is deprecated. In a future version, it will default to F
alse. Select only valid columns or specify the value of numeric_only to silence
this warning.
df1.cov()
Out[10]: year passengers

year 12.000000 383.087413

passengers 383.087413 14391.917201

https://green-florist-pbqnj.pwskills.app/lab/workspaces/auto-g/tree/work/Covariance and correlation.ipynb 2/2


5/26/23, 9:43 PM Notebooks

Assignment : Statistics Basic-1

Q1. What is Statistics?


Statistics is the science of collecting, organizing and
analyzing data.
Data means facts or peices of information.

eg: Weights of students at college=[56,50,60,48,54,68]

eg: IQ of Students=[70,100,80,90,60]

Statistics helps to understand the data and draw Conclusions/


Inferences.

It helps to visualize data and take decisions.

Q2. Define the different types of statistics and give


an example of when each type might be used.
Types of statistics are 1.Descriptive Statistics 2.Inferential
Statistics

Descriptive Statistics:
It consists of organizing and summarizng data.

Eg : What is the average height of the entire classroom?


-->Summation of the Heights / Number of Students in classroom.

Inferential Statistics
It consists of using the data you have measured to drawn
conclusions.

Eg : Are the heigths of students in the classroom similar to what


you expect in the entire college?
--> Here, we have to draw conclusions for population data based
on the results of sample data. Different techniques/ tests are
used for the same.

Q3. What are the different types of data and how


do they differ from each other? Provide an
example of each type of data.
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment .ipynb 1/7
5/26/23, 9:43 PM Notebooks

Types of Data:

1)Quantitative Data - 1.Discrete Data and 2. Continuous Data.

Discrete Data:
-->Whole numbers
-->Specific Range
Eg: No. of students in classroom, No. of Family Members, No. of
Vehicles on road,etc.

Continuous Data:
-->Any Value
Eg: Heights, Weight, Temprature, Volume, Speed, etc.

2)Qualitative Data - 1.Nominal and 2.Ordinal

Nominal Data:
--> No Ranks
Eg: Gender, Blood Group, Pincode, Favourite Colour,etc.

Ordinal Data:
--> Ranks
Eg: Marks of Students, Feedbacks,etc.

Q4. Categorise the following datasets with respect


to quantitative and qualitative data types:
(i) Grading in exam: A+, A, B+, B, C+, C, D, E : ORDINAL
DATA.

(ii) Colour of mangoes: yellow, green, orange, red : NOMINAL


DATA.

(iii) Height data of a class: [178.9, 179, 179.5, 176, 177.2, 178.3,
175.8,...] : CONTINUOUS DATA.

(iv) Number of mangoes exported by a farm: [500, 600, 478,


672, ...] : DISCRETE DATA.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment .ipynb 2/7


5/26/23, 9:43 PM Notebooks

Q5. Explain the concept of levels of measurement


and give an example of a variable for each level.
Quantitative/ Categorical Data:
eg: Gender, Colours, Labels.
ORDER DOES NOT MATTERS.

Ordinal Scale Data:


eg: Ranks in Apps. Feedback in Stars, etc
ORDER MATTERS.
RANKING IS IMPORTANT.
DIFFERENCE CANNOT BE MEASURED.

Interval Scale Data:


eg:Temprature=[30F, 60F, 90F, 120F]
ORDER MATTERS
DIFFERNCE AND RATIO CANNOT BE MEASURED

Ratio Scale Data.


eg: Marks of students
ORDER MATTERS.
DIFFERENCE AND RATIO CAN BE CALCULATED.

Q6. Why is it important to understand the level of


measurement when analyzing data? Provide an
example to illustrate your answer.
Understanding the level of measurement is crucial when analyzing data because it
determines the appropriate statistical techniques that can be applied and the
meaningfulness of the conclusions drawn from the analysis. The level of measurement
refers to the nature and properties of the data being collected, which can be classified into
four main levels: nominal, ordinal, interval, and ratio.

let's consider an example to illustrate the importance of understanding the level of


measurement. Suppose a researcher wants to analyze the happiness levels of participants
in a study. The researcher collects data by asking participants to rate their happiness on a
scale from 1 to 5 (ordinal level).

If the researcher mistakenly treats the ordinal data as interval data and performs
mathematical operations, such as taking the average of the happiness ratings, it would be
misleading. While averaging the ratings might provide a single value, it would not

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment .ipynb 3/7


5/26/23, 9:43 PM Notebooks

accurately represent the participants' true level of happiness since the intervals between
the categories are not equal. Instead, the researcher should use non-parametric statistical
tests suitable for ordinal data, such as the Mann-Whitney U test or the Kruskal-Wallis test,
which consider the ranking order of the categories.

By understanding the level of measurement, researchers can apply appropriate statistical


techniques, interpret the data correctly, and draw meaningful conclusions. Ignoring or
misinterpreting the level of measurement can lead to erroneous analyses and flawed
conclusions.

Q7. How nominal data type is different from


ordinal data type.
Nominal and ordinal data types are distinct in terms of their characteristics and the level of
information they convey. Here's a more detailed explanation of the differences between
nominal and ordinal data types:

Nominal Data:

--> No Ranks Eg: Gender, Blood Group, Pincode, Favourite Colour,etc.

Ordinal Data:

--> Ranks Eg: Marks of Students, Feedbacks,etc.

In summary, the main distinction between nominal and ordinal data lies in the nature of
the relationship between the categories. Nominal data consists of categories without any
order or ranking, while ordinal data has a natural order among the categories, although
the intervals between them may not be equal. Understanding these differences is crucial
when choosing appropriate statistical analyses and interpreting the results accurately.

Q8. Which type of plot can be used to display data


in terms of range?
To display data in terms of range, a common type of plot that can be used is a box plot,
also known as a box-and-whisker plot.

A box plot provides a visual representation of the minimum, maximum, median, and
quartiles of a dataset.

It effectively shows the range and distribution of the data.

Box:
The box represents the interquartile range (IQR), which spans the
middle 50% of the data. The bottom of the box corresponds to the
first quartile (Q1), and the top of the box corresponds to the
third quartile (Q3). The median is usually represented as a line
within the box.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment .ipynb 4/7


5/26/23, 9:43 PM Notebooks

Whiskers:
The whiskers extend from the box and represent the range of the
data, excluding outliers. The whiskers can be calculated in
different ways, such as extending to the minimum and maximum
values within a certain range or considering certain percentile
thresholds.

Outliers:
Individual data points that fall outside the whiskers are
considered outliers and are typically represented as individual
points or asterisks on the plot.

By using a box plot, you can quickly visualize the spread of the data, including the
minimum and maximum values, as well as the distribution across quartiles. It provides a
concise summary of the range and helps identify potential outliers or extreme values.

Box plots are especially useful when comparing multiple datasets or groups, as they allow
for easy visual comparison of the ranges and distributions across different categories or
variables.

Note that other types of plots, such as range plots or error bars, can also display data in
terms of range to some extent, but they may not provide as comprehensive information
about the quartiles and distribution as a box plot does.

Q9. Describe the difference between descriptive


and inferential statistics. Give an example of each
type of statistics and explain how they are used.
Descriptive statistics and inferential statistics serve different purposes in analyzing and
interpreting data. Descriptive statistics focus on summarizing and describing the main
characteristics of a dataset, providing insights into its patterns and attributes. Measures
such as the mean, median, range, and standard deviation are calculated to convey the
central tendency, dispersion, and spread of the data. For example, descriptive statistics can
be used to summarize the average age, the range of ages, and the distribution of ages in a
given dataset.

In contrast, inferential statistics involve drawing conclusions or making inferences about a


larger population based on a sample of data. These statistical techniques utilize probability
theory to assess the likelihood of certain outcomes and generalize findings beyond the
specific sample. Hypothesis testing, confidence intervals, and regression analysis are
common tools used in inferential statistics. For instance, inferential statistics can be applied
to determine if there is a significant difference in average income between two cities,
based on income data collected from samples in each city.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment .ipynb 5/7


5/26/23, 9:43 PM Notebooks

In summary, descriptive statistics provide a concise summary and description of data,


enabling a deeper understanding of its characteristics. Inferential statistics, on the other
hand, allow for generalizations and predictions about populations beyond the data at
hand, using statistical techniques to draw conclusions and make inferences. Together,
these two branches of statistics contribute to a comprehensive analysis and interpretation
of data.

Q10. What are some common measures of central


tendency and variability used in statistics? Explain
how each measure can be used to describe a dataset.

Measures of Central Tendency:

Mean:
The mean, or average, is the sum of all values in a dataset
divided by the number of observations. It represents the central
value around which the data points tend to cluster. The mean is
sensitive to extreme values and can be influenced by outliers.

Median:
The median is the middle value in a dataset when the observations
are arranged in ascending or descending order. It is less
affected by extreme values compared to the mean. The median
represents the central value that divides the dataset into two
equal halves.

Mode:
The mode represents the most frequently occurring value or values
in a dataset. It is useful for identifying the most common
observation or category in categorical data.

Measures of Variability:

Standard Deviation:
The standard deviation measures the average amount of deviation
or dispersion of data points from the mean. It quantifies the
spread of the dataset by considering the differences between each
value and the mean, taking into account the variability of the
entire dataset.

Variance:
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment .ipynb 6/7
5/26/23, 9:43 PM Notebooks

The variance is the average of the squared differences between


each data point and the mean. It provides a measure of the
variability by considering the spread of values around the mean.
The variance is directly related to the standard deviation, as it
is the square of the standard deviation.

THE END
In [ ]:

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment .ipynb 7/7


5/26/23, 9:44 PM Notebooks

Statistics Basic- Assignment 2

Q1. What are the three measures of central


tendency?
The 3 measures of central tendency are Mean, Median, Mode.

Mean (Average)- 1. Population Mean. and 2.Sample


Mean.
Population mean=
Eg: data={1,2,3,4,5,6,7,8,9,10}

Population mean = (1+2+3+4+5+6+7+8+9+10)/ 10


μ = 5.5

Sample mean= (1+3+5+7+9)/5


x̄ = 5

Median- The central element of the data is called


Median.
Steps: 1> If the no. of elements in data is odd the element at
the centre wil,be median.
Eg: m1={1,2,3,4,5}
median of m1 is 3.
2> If the no. of elements in the data is even,
then median will be the average of the 2 central
elements.
Eg: m2={1,2,3,4,5,6}
median of m2 = (3+4)/2 =7/2 = 3.5

Mode- It is the most recurring element of data.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 1/9


5/26/23, 9:44 PM Notebooks

M = {1,1,1,2,2,3,3,3,3,4,5,6}
mode of M = 3 (the most recurring element)

Q2. What is the difference between the


mean, median, and mode? How are they
used to measure the central tendency of a
dataset?
Mean is the average of the elements in the dataset. While, median is the central element in
the dataset (steps to find explained above). Mode is the element which occurs the most no.
of times.

To determine the appropriate measure of central tendency for a given dataset, consider the
type of data, the presence of outliers, and the shape of the distribution. For normally
distributed numerical data without outliers, the mean is often a good choice. When dealing
with skewed data or potential outliers, the median may provide a more representative
value. The mode is most applicable for categorical or nominal data, or when identifying the
most common value in a dataset.

It's important to note that while these measures provide information about central
tendency, they do not capture the entire picture of the dataset's distribution.

Q3. Measure the three measures of central


tendency for the given height data:
In [1]:
data=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
import numpy as np

In [3]:
np.mean(data)

Out[3]: 177.01875

In [4]:
np.median(data)

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 2/9


5/26/23, 9:44 PM Notebooks

Out[4]: 177.0

In [5]:
from scipy import stats

In [6]:
stats.mode(data)

/tmp/ipykernel_966/3267261142.py:1: FutureWarning: Unlike other reduction functions


(e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the a
xis it acts along. In SciPy 1.11.0, this behavior will change: the default value of
`keepdims` will become False, the `axis` over which the statistic is taken will be
eliminated, and the value None will no longer be accepted. Set `keepdims` to True o
r False to avoid this warning.
stats.mode(data)
Out[6]: ModeResult(mode=array([177.]), count=array([3]))

Q4. Find the standard deviation for the


given data:
In [7]:
data=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [8]:
np.std(data)

Out[8]: 1.7885814036548633

Q5. How are measures of dispersion such


as range, variance, and standard deviation
used to describe the spread of a dataset?
Provide an example.

Variance
Variance measures the average squared deviation of each data point from the mean. It
quantifies the spread of the dataset by considering the differences between each value and
the mean. A higher variance indicates a greater dispersion.

To calculate the variance, you subtract the mean from each value, square the differences,
sum them up, and divide by the total number of values. For example, consider the
following dataset of exam scores: 65, 70, 75, 80, 85.

The mean is 75. The squared differences from the mean are: (65-75)^2, (70-75)^2, (75-
75)^2, (80-75)^2, (85-75)^2. Adding them up and dividing by 5 (the number of values)
gives you the variance. The variance provides a more comprehensive measure of
dispersion than the range but is influenced by the units of the data (since it involves
squaring the differences).

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 3/9


5/26/23, 9:44 PM Notebooks

In [9]:
exam_scores=[65, 70, 75, 80, 85]
np.mean(exam_scores)

Out[9]: 75.0

Standard Deviation
The standard deviation is the square root of the variance and is often used as a more
intuitive measure of dispersion. It measures the average amount by which the data points
deviate from the mean.

The standard deviation is calculated by taking the square root of the variance. Using the
same example of exam scores, once you have the variance, you can calculate the standard
deviation by taking the square root of the variance.

The standard deviation provides a more interpretable measure of spread since it is in the
same units as the original data. It is widely used in statistics and helps assess the variability
and consistency of the dataset.

In [10]:
np.std(exam_scores)

Out[10]: 7.0710678118654755

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 4/9


5/26/23, 9:44 PM Notebooks

Range
The range is the simplest measure of dispersion and represents the difference between the
largest and smallest values in a dataset. It gives an idea of the total spread of the data. For
example, if you have a dataset of exam scores: 65, 70, 75, 80, 85, the range would be 85 -
65 = 20. The range is easy to calculate but can be influenced by outliers and may not
provide a complete understanding of the distribution.

In [11]:
range= max(exam_scores) - min(exam_scores)
range

Out[11]: 20

Q6. What is a Venn diagram?


A Venn diagram is a visual representation of the relationships between different sets or
groups of items. It uses overlapping circles or other shapes to show the commonalities and
differences among the sets. The primary purpose of a Venn diagram is to illustrate the
logical relationships between sets and to provide a clear visual representation of the set
intersections and unions.

In a Venn diagram, each circle or shape represents a set, and the overlapping areas show
the elements that are shared between the sets. The non-overlapping areas represent the
unique elements of each set. The size of each circle does not indicate the size of the set; it
is used purely for visualization purposes.

Q7. For the two given sets A = (2,3,4,5,6,7) & B =


(0,2,6,8,10). Find:

(i) A intersection B

(ii) A ⋃ B
In [12]:
A = {2,3,4,5,6,7}
B = {0,2,6,8,10}

In [13]:
A & B
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 5/9
5/26/23, 9:44 PM Notebooks

Out[13]: {2, 6}

In [14]:
A|B

Out[14]: {0, 2, 3, 4, 5, 6, 7, 8, 10}

Q8. What do you understand about skewness in


data?
Skewness is a measure of the asymmetry or departure from symmetry in a probability
distribution or dataset. It provides information about the shape and distribution of the
data.

Positive Skewness (Right Skewness): In a positively skewed distribution, the tail on the right
side of the distribution is longer or more pronounced than the left tail. This means that the
majority of the data is concentrated on the left side of the distribution, while a few extreme
values are present on the right side.

Negative Skewness (Left Skewness): In a negatively skewed distribution, the tail on the left
side of the distribution is longer or more pronounced than the right tail. This indicates that
the majority of the data is concentrated on the right side of the distribution, while a few
extreme values exist on the left side.

Zero Skewness: A distribution with zero skewness is perfectly symmetric, where the left and
right tails are equally balanced.

Q9. If a data is right skewed then what will be the


position of median with respect to mean?

mean> median> mode.


If a data is right skewed, then the mean will be greater than the median. This is because
the mean is affected by outliers, while the median is not. In a right skewed distribution,
there will be a few very large values that pull the mean to the right. The median, on the
other hand, is not affected by outliers, so it will be closer to the center of the distribution.

In [15]:
snippet=[1, 2, 3, 4, 5,6, 7, 8, 9, 10, 100]

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 6/9


5/26/23, 9:44 PM Notebooks

In [16]:
ls=[1,2,3,4,5,6,7,8,9,10]

CHECKING MEAN

In [17]:
import numpy as np
np.mean(snippet)

Out[17]: 14.090909090909092

In [18]:
np.mean(ls)

Out[18]: 5.5

CHECKING MEDIAN

In [19]:
np.median(snippet)

Out[19]: 6.0

In [20]:
np.median(ls)

Out[20]: 5.5

the mean is pulled to the right by the outlier of 100. The median, on the other hand, is not
affected by the outlier, so it is closer to the center of the distribution.

Q10. Explain the difference between covariance


and correlation. How are these measures used in
statistical analysis?
Covariance and correlation are both measures of the relationship between two variables.
However, they measure different things.

Covariance measures the extent to which two variables vary together. It can be positive,
negative, or zero. A positive covariance indicates that the variables tend to move in the
same direction, while a negative covariance indicates that they tend to move in opposite
directions. A zero covariance indicates that there is no linear relationship between the
variables.

Correlation is a standardized measure of covariance. This means that it is not affected by


the scale of the variables. Correlation can range from -1 to +1. A correlation of +1
indicates a perfect positive linear relationship, a correlation of -1 indicates a perfect
negative linear relationship, and a correlation of 0 indicates no linear relationship.
Covariance and correlation are both used in statistical analysis. Covariance is often used in
regression analysis, while correlation is often used in hypothesis testing and exploratory
data analysis.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 7/9


5/26/23, 9:44 PM Notebooks

Here are some examples of how covariance and correlation are used in statistical analysis:

Regression analysis is a statistical method that uses one or more independent variables to
predict a dependent variable. Covariance is often used in regression analysis to measure
the strength of the relationship between the independent and dependent variables.

Hypothesis testing is a statistical method that is used to determine whether there is a


significant difference between two or more groups. Correlation can be used in hypothesis
testing to determine whether there is a significant relationship between two variables.

Exploratory data analysis is a statistical method that is used to explore the data and to
identify patterns and relationships. Correlation can be used in exploratory data analysis to
identify relationships between variables.

Q11. What is the formula for calculating the sample


mean? Provide an example calculation for a
dataset.
n
∑ xi
– i=1
x =
n

Data :
,
x1 = 1 x2 = 2 x3 = 3 x4 = 4 , ,

– 1+2+3+4
x = = 2.5
4

Q12. For a normal distribution data what is the


relationship between its measure of central
tendency?
The relationship between the measures of central tendency in a normal distribution is that
they are all equal. This is because a normal distribution is symmetrical, meaning that the
mean, median, and mode are all located in the center of the distribution. As a result, they
all represent the same value.

Q13. How is covariance different from correlation?

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 8/9


5/26/23, 9:44 PM Notebooks

Covariance measures the extent to which two variables vary together. It can be positive,
negative, or zero. A positive covariance indicates that the variables tend to move in the
same direction, while a negative covariance indicates that they tend to move in opposite
directions. A zero covariance indicates that there is no linear relationship between the
variables.

Correlation is a standardized measure of covariance. This means that it is not affected by


the scale of the variables. Correlation can range from -1 to +1. A correlation of +1
indicates a perfect positive linear relationship, a correlation of -1 indicates a perfect
negative linear relationship, and a correlation of 0 indicates no linear relationship.

Q14. How do outliers affect measures of central


tendency and dispersion? Provide an example.
Outliers are data points that are very different from the rest of the data. They can affect
measures of central tendency and dispersion in a number of ways.

Mean : The mean is the average of all the data points. Outliers can cause the mean to
be pulled in the direction of the outlier. For example, if we have a dataset of heights
with an average of 5'8" and one outlier of 7'2", the mean will be 6'0".

Median : The median is the middle value of the data points, when they are arranged
in increasing or decreasing order. Outliers do not affect the median. For example, if we
have a dataset of heights with a median of 5'8" and one outlier of 7'2", the median will
still be 5'8".

Mode : The mode is the value that appears most often in the data set. Outliers can
affect the mode, but not always. For example, if we have a dataset of heights with a
mode of 5'8" and one outlier of 7'2", the mode may still be 5'8". However, if the outlier
is very different from the rest of the data, it may be the mode instead.

Range : The range is the difference between the largest and smallest values in the
data set. Outliers can increase the range. For example, if we have a dataset of heights
with a range of 6", and one outlier of 7'2", the range will be 72".

Standard deviation : The standard deviation is a measure of how spread out the data
is. Outliers can increase the standard deviation. For example, if we have a dataset of
heights with a standard deviation of 2", and one outlier of 7'2", the standard deviation
will be 3".

In general, outliers can make it more difficult to interpret data. It is important to identify
outliers and to decide whether or not to remove them from the data set.

THE END.
In [ ]:

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Basics Assignment 2 .ipynb 9/9


5/26/23, 9:44 PM Notebooks

Statistics Advance Assignment

Q1. What is the Probability density function?


Probability Density Function is a type of Probability
Distribution Function which denotes distribution of data.
It is for Continuous Random Variable. For Example: Height of
Students in class, Weight of students in a class, etc.

Q2. What are the types of Probability distribution?

Types of Probability Distribution are :


1.Normal/ Gaussian Distribution(pdf)

2.Bernoulli Distribution(pmf)

3.Binomial Distribution(pmf)

4.Poission Disttribution (pmf)

5.Log Normal Distribution(pdf)

6.Uniform Distibution(pmf)

Q3. Write a Python function to calculate the


probability density function of a normal
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 1.ipynb 1/6
5/26/23, 9:44 PM Notebooks

distribution with given mean and standard


deviation at a given point.
In [1]:
import numpy as np

In [2]:
def normal_pdf(x, mean, std):

# Calculate the exponent.


exponent = (x - mean)**2 / (2 * std**2)

# Calculate the probability density function.


pdf = 1 / (std * np.sqrt(2 * np.pi)) * np.exp(-exponent)

return pdf

In [3]:
# Calculate the probability density function of a normal distribution with mean 0
pdf = normal_pdf(1, 0, 1)

# Print the probability density function.


print(pdf)

0.24197072451914337

In the example above, we calculated the probability density function of a normal


distribution with mean 0 and standard deviation 1 at the point 1. The output was
0.3989422804014327, which means that there is a 39.89422804014327% chance that a
random variable from this distribution will be at the point 1.

Q4. What are the properties of Binomial


distribution? Give two examples of events where
binomial distribution can be applied.
Binomial Distribution can also said as a group of Bernoulli Distribution.

* Outcome of every expirement is binary.


* This expirement is performed for n no. of trials.
* Sequence of outcome is called as Bernoulli Process.

Notation: B(n,p)
Parameters: n Belongs to {0,1,2,3,4...} No. of trials.
p belongs to {0,1} Success Probability for each trial.
q= 1-p

Examples:
Tossing a coin for 10 times,

rolling a dice for 5 times,etc.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 1.ipynb 2/6


5/26/23, 9:44 PM Notebooks

Mean=np
Var= npq
Std= sqrt of Var.

Q5. Generate a random sample of size 1000 from a


binomial distribution with probability of success
0.4 and plot a histogram of the results using
matplotlib.
In [4]:
sample_size = 1000
probability_of_success = 0.4

# Generate the random sample


sample = np.random.binomial(1, probability_of_success, size=sample_size)

In [5]:
import matplotlib.pyplot as plt

In [6]:
plt.hist(sample, bins=2, range=[0, 1], edgecolor='black')
plt.xlabel('Success')
plt.ylabel('Frequency')
plt.title('Binomial Distribution')
plt.xticks([0, 1], ['Failure', 'Success'])
plt.show()

Q6. Write a Python function to calculate the


cumulative distribution function of a Poisson
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 1.ipynb 3/6
5/26/23, 9:44 PM Notebooks

distribution with given mean at a given point.


In [24]:
import scipy.stats as stats

def poisson_cdf(mean, k):


cdf = stats.poisson.cdf(k, mu=mean)
return cdf

In [25]:
poisson_cdf(15,5)

Out[25]: 0.0027924293327009145

Q7. How Binomial distribution different from


Poisson distribution?

Binomial Distribution:
Binomial Distribution can also said as a group of Bernoulli
Distribution.

Outcome of every expirement is binary.


This expirement is performed for n no. of trials.
Sequence of outcome is called as Bernoulli Process.

Notation: B(n,p)

Parameters: n Belongs to {0,1,2,3,4...} No. of trials.

p belongs to {0,1} Success Probability for each trial.

q= 1-p

Examples:

Tossing a coin for 10 times,

rolling a dice for 5 times,etc.

Poisson Distribution:
Discrete random variable.
Describes the no. of events occuring in a fixed time interval.

Examples:

No. of people visting a hospital every hour.

No. of people visting a bank every hour.


https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 1.ipynb 4/6
5/26/23, 9:44 PM Notebooks

No. of people visting a mall every hour.

Var = Mean = Expected no. of events to occur at every time interval * time interval

Q8. Generate a random sample of size 1000 from a


Poisson distribution with mean 5 and calculate the
sample mean and variance.
In [34]:
sample=np.random.poisson(lam=5,size=1000)

In [36]:
sample_mean = np.mean(sample)
sample_variance = np.var(sample)

# Print the results


print("Sample Mean:", sample_mean)
print("Sample Variance:", sample_variance)

Sample Mean: 4.958


Sample Variance: 4.866236

Q9. How mean and variance are related in Binomial


distribution and Poisson distribution?

In Poisson Distribution,

Mean = Variance = Expected no. of events to occur at every time


interval * time interval

In Binomial Distribution,

Mean=np
Var= npq

Where, n = no. of trials, p= success probability for each trial, q=


1-p

Q10. In normal distribution with respect to mean


position, where does the least frequent data
appear?

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 1.ipynb 5/6


5/26/23, 9:44 PM Notebooks

In a normal distribution, the least frequent data appears in the tails of the distribution,
farther away from the mean.

The normal distribution is symmetric, with the mean located at the center. The probability
density function (PDF) of the normal distribution decreases gradually as you move away
from the mean in both directions.

As you move towards the tails of the distribution, the probability of observing data points
decreases. The data points located in the tails, which are farther away from the mean, are
less frequent compared to the data points closer to the mean.

Therefore, the least frequent data appears in the extreme ends of the distribution, in the
tails, while the most frequent data is concentrated around the mean.

The End
In [ ]:

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 1.ipynb 6/6


5/26/23, 9:44 PM Notebooks

Statistics Advance Assignment 2

Q1: What are the Probability Mass Function (PMF)


and Probability Density Function (PDF)? Explain
with an example.
Probability Distribution Function denotes distribution of data.

Probability Mass Function (PMF) and Probability Density Function (PDF) are types of
Probability Distribution Function.

Probability Density Function (PDF)


Continuous Random Variable

Eg: Height of students in a class, Weight of students in a class. Probability density ranges
between 0-1.

Probability Mass Function (PMF)


Discrete Random Variable

Eg: Rolling a Dice.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 1/8


5/26/23, 9:44 PM Notebooks

Q2: What is Cumulative Density Function (CDF)?


Explain with an example. Why CDF is used?
In probability theory and statistics, the cumulative distribution function (CDF) of a real-
valued random variable X, or just distribution function of X, evaluated at x, is the
probability that X will take a value less than or equal to x.

The CDF is a monotonically increasing function, meaning that it is always increasing from
left to right. The CDF is also continuous from the right, meaning that it is continuous at all
points except for its endpoint.

Q3: What are some examples of situations where


the normal distribution might be used as a model?
Explain how the parameters of the normal
distribution relate to the shape of the distribution.
Examples of Normal / Gaussian Distribution include:

Intelligence Quotient
Ages
Weights
Hieghts and much more...

Variance is directly propotional to Spread/ Gradient.

The shape of the normal distribution is determined by two parameters: the mean (μ) and
the standard deviation (σ).

Mean (μ): The mean determines the center or peak of the distribution. It represents
the average value around which the data cluster. Shifting the mean to the right or left
will shift the entire distribution accordingly.

Standard Deviation (σ): The standard deviation determines the spread or dispersion of
the distribution. A larger standard deviation results in a wider and flatter distribution,
indicating more variability in the data. Conversely, a smaller standard deviation leads
to a narrower and taller distribution, indicating less variability.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 2/8


5/26/23, 9:44 PM Notebooks

Together, the mean and standard deviation uniquely define the shape of the normal
distribution. Altering these parameters will shift the distribution horizontally or vertically
while preserving its characteristic bell shape.

Q4: Explain the importance of Normal Distribution.


Give a few real-life examples of Normal
Distribution.

Examples of Normal / Gaussian Distribution include:


Intelligence Quotient
Ages
Weights
Financial Markets
Hieghts and much more...

Central Limit Theorem: The normal distribution plays a fundamental role in the Central
Limit Theorem (CLT).

Data Modeling: The normal distribution provides a useful framework for modeling and
analyzing continuous data in many real-world scenarios.

Statistical Inference: Many statistical techniques and hypothesis tests rely on the
assumption of normality.

Quality Control: In manufacturing and quality control processes, the normal distribution is
often used to monitor and assess product quality.

Q5: What is Bernaulli Distribution? Give an


Example. What is the difference between Bernoulli
Distribution and Binomial Distribution?

Bernaulli Distribution
In probability theory and statistics, the Bernoulli distribution, named after Swiss
mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable
which takes the value 1 with probability p and the value 0 with probability q= 1-p. Less
formally, it can be thought of as a model for the set of possible outcomes of any single
experiment that asks a yes–no question.

Outcomes are binary


Discrete Random Variable (PMF)

Example : Tossing a coin, To pass or fail in examination.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 3/8


5/26/23, 9:44 PM Notebooks

Binomial Distribution.
Binomial Distribution can also said as a group of Bernoulli Distribution.

Outcome of every expirement is binary.

This expirement is performed for n no. of trials.

Sequence of outcome is called as Bernoulli Process.

Notation: B(n,p) Parameters: n Belongs to {0,1,2,3,4...} No. of trials. p belongs to {0,1}


Success Probability for each trial. q= 1-p

Examples:

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 4/8


5/26/23, 9:44 PM Notebooks

Tossing a coin for 10 times,

rolling a dice for 5 times,etc.

Q6. Consider a dataset with a mean of 50 and a


standard deviation of 10. If we assume that the
dataset is normally distributed, what is the
probability that a randomly selected observation
will be greater than 60? Use the appropriate
formula and show your calculations.

Lets consider teh following dataset


In [2]:
Dataset={30,40,50,60,70}

The answer to this question can be found by using Z-score.

Z-score= (mean - xi )/ std

In [4]:
(60-50)/10

Out[4]: 1.0

Z-score of 1.0 is 0.5398

for > 60

In [1]:
1-0.5398

Out[1]: 0.46020000000000005

means there is 46.0200% chances that a randomly selected


observation will be greater than 60.

Q7: Explain uniform Distribution with an example.

1. Continuous Uniform Distribution.


Continuous Random Variable.
Symmetric Probability Distribution.
Arbitrary outcomes between certain boundries

Notation - U(a,b)

a - minimum value , b - maximum value

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 5/8


5/26/23, 9:44 PM Notebooks

neg infinity < a < b < infinity

Example:
The number of candies sold daily at a shop is uniformly distibuted with minimum of 10 and
maximum of 40

2. Discrete Uniform Distribution.


Finite no. of outcomes equally likely to happen.
Symmetric.

Example :
Rolling a Dice. Outcomes can be {1,2,3,4,5,6}

Q8: What is the z score? State the importance of


the z score.
In statistics, a z-score is a number that tells you how many standard deviations a specific
data point is away from the mean. Z-scores are used to compare data points that have
been measured on different scales. For example, you could use z-scores to compare the
heights of students in different grades, or the test scores of students in different schools.

To calculate a z-score, you first need to find the mean and standard deviation of the data
set. The mean is the average of the data points, and the standard deviation is a measure of
how spread out the data points are. Once you have the mean and standard deviation, you
can calculate the z-score for each data point using the following formula:

where:
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 6/8
5/26/23, 9:44 PM Notebooks

z is the z-score
x is the data point
μ is the mean
σ is the standard deviation

Z-scores are important because they allow you to compare data points that have been
measured on different scales. For example, if you have a data set of heights and a data set
of test scores, you can use z-scores to compare the heights of the students to the test
scores of the students. This is because z-scores are a measure of how far away a data point
is from the mean, regardless of the scale that the data was measured on.

Z-scores are also used in statistical tests to compare the means of two or more groups. For
example, you could use a z-test to compare the heights of students in different grades. The
z-test would calculate the z-scores for the heights of the students in each grade, and then
it would compare the z-scores to see if there is a significant difference between the heights
of the students in the different grades.

Q9: What is Central Limit Theorem? State the


significance of the Central Limit Theorem.
The Central Limit Theorem (CLT) is a fundamental concept in statistics that states that when
independent random variables are summed or averaged, regardless of their underlying
distribution, the resulting distribution tends to be approximately normal, as the sample
size increases.

In other words, the CLT describes the behavior of sample means or sums from any
distribution as the sample size becomes large, and the resulting distribution tends to be
normal, regardless of the underlying distribution. This theorem is essential because it
allows us to make inferences about the population from a relatively small sample.

Significance of the Central Limit Theorem:


Sampling Distribution: The CLT provides a theoretical basis for sampling distributions,
which are used to estimate population parameters from a sample. The sampling
distribution can be used to determine the mean and variance of a sample, which can
be used to make inferences about the population mean and variance.

Inferential Statistics: The CLT provides the foundation for inferential statistics, which is
the process of making inferences about a population based on a sample. Inferential
statistics are widely used in many fields, including business, economics, biology, and
social sciences.

Hypothesis Testing: The CLT is crucial in hypothesis testing. By assuming the


population distribution to be normal, we can calculate the probability of obtaining a
particular sample mean or sum. This probability is used to make inferences about the
population mean or sum, and to test hypotheses about the population parameters.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 7/8


5/26/23, 9:44 PM Notebooks

Real-world Applications: The CLT has significant applications in the real world. It is
widely used in quality control, finance, engineering, and medical research, where it
helps in assessing the accuracy of measurements, estimating population parameters,
and determining sample sizes.

Q10: State the assumptions of the Central Limit


Theorem.
The Central Limit Theorem (CLT) relies on a set of assumptions to hold true in order to
apply its principles effectively. The assumptions of the Central Limit Theorem are as
follows:

Independent and Identically Distributed (IID) Variables: The random variables being
summed or averaged must be independent of each other. Additionally, they should be
drawn from the same underlying distribution, meaning they have identical probability
distributions.

Finite Variance: The variables must have a finite variance (a measure of the dispersion
of the data). If the variance is infinite or does not exist, the CLT may not hold.

Sample Size: The CLT is applicable as the sample size increases. As the sample size
grows larger, the approximation to a normal distribution becomes more accurate.

The End

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 2.ipynb 8/8


5/26/23, 9:45 PM Notebooks

Statistics Advance Assignment

Q1: What is Estimation Statistics? Explain point


estimate and interval estimate.
Defination: Estimation statistics is a branch of statistics that deals with estimating unknown
population parameters based on sample data. It involves using sample statistics to make
inferences about population parameters.

Estimates: It is observed numerical value used to estimate the unknown population


parameter.

Point Estimate: Single numerical value used to estimate population parameter. Example:
Sample mean is a point estimate of population mean.

Interval Estimate: Range of value used to estimate the unknown population parameter.
Interval estimates of population parameter are called as confindence intervals.

Q2. Write a Python function to estimate the


population mean using a sample mean and
standard deviation.
In [1]:
# Lets consider the following data:

sample_mean = 50
standard_deviation = 10
sample_size = 25
# Sigma = 5% = 0.5
#Confidnece_Interval=0.95

In [2]:
def estimate(sample_mean, standard_deviation,Sample_size ):
# assuming C.I as 95% Z-score will be 1.96
lower_confidnece_interval= sample_mean -(1.96 * (standard_deviation/ sample_s
upper_confidnece_interval= sample_mean +(1.96 * (standard_deviation/ sample_s

return "I am 95% confident that the mean lies between [{:.2f}, {:.2f}]".forma

In [3]:
estimate(50,10,25)
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 1/13
5/26/23, 9:45 PM Notebooks

Out[3]: 'I am 95% confident that the mean lies between [46.08, 53.92]'

Q3: What is Hypothesis testing? Why is it used?


State the importance of Hypothesis testing.

Defination:
Hypothesis testing is a statistical method used to make inferences and draw conclusions
about a population based on a sample of data. It involves formulating two competing
hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and conducting
statistical tests to determine the likelihood of the observed data supporting one
hypothesis over the other.

Uses:
Hypothesis testing is used to assess the validity of assumptions or claims about a
population parameter or the relationship between variables. It helps researchers and
analysts make evidence-based decisions by providing a systematic framework for
evaluating the significance and reliability of their findings. By testing hypotheses, we can
determine if there is sufficient evidence to support a particular claim or if the observed
results are simply due to chance.

Importance:
The importance of hypothesis testing lies in its ability to provide a scientific and objective
approach to decision-making. It helps in validating or refuting research hypotheses,
determining the effectiveness of treatments or interventions, and assessing the significance
of relationships or differences between variables. Hypothesis testing allows us to draw
meaningful conclusions from data, make informed decisions, and contribute to the
advancement of knowledge in various fields, including science, medicine, business, and
social sciences.

Q4. Create a hypothesis that states whether the


average weight of male college students is greater
than the average weight of female college
students.

Step 1 :
Null Hypotheisis [H0] :
The average weight of male college students is greater than the average weight of female
college students.

Step 2 :
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 2/13
5/26/23, 9:45 PM Notebooks

Alternate Hypotheisis[H1] :
The average weight of male college students is not greater than the average weight of
female college students.

Step 3 :
95 % Confindence Interval . C.I= 0.95 Sigma =0.5 1-0.5=0.9750 Z-score of 0.9750 is 1.96

Step 4 :
Statistical Analysis:
Statistical Analysis can be done using the following formula

Z-test=

Q5. Write a Python script to conduct a hypothesis


test on the difference between two population
means, given a sample from each population.
In [4]:
import numpy as np
from scipy.stats import t

# Sample data from population 1 and population 2


pop1 = np.random.normal(10, 5, size=100)
pop2 = np.random.normal(12, 5, size=100)

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 3/13


5/26/23, 9:45 PM Notebooks

# Calculate the sample means and sample standard deviations


mean1 = np.mean(pop1)
mean2 = np.mean(pop2)
std1 = np.std(pop1, ddof=1)
std2 = np.std(pop2, ddof=1)

# Calculate the test statistic and degrees of freedom


n1, n2 = len(pop1), len(pop2)
s_pooled = np.sqrt(((n1 - 1) * std1 ** 2 + (n2 - 1) * std2 ** 2) / (n1 + n2 - 2))
t_stat = (mean1 - mean2) / (s_pooled * np.sqrt(1/n1 + 1/n2))
df = n1 + n2 - 2

# Calculate the p-value


p_value = (1 - t.cdf(abs(t_stat), df)) * 2

# Set the significance level


alpha = 0.05

# Compare the p-value to the significance level and draw a conclusion


if p_value < alpha:
print(f'Test statistic: {t_stat:.2f}')
print(f'p-value: {p_value:.4f}')
print('Reject the null hypothesis')
else:
print(f'Test statistic: {t_stat:.2f}')
print(f'p-value: {p_value:.4f}')
print('Fail to reject the null hypothesis')

Test statistic: -2.77


p-value: 0.0062
Reject the null hypothesis

Q6: What is a null and alternative hypothesis? Give


some examples.

1] Null Hypothesis:
The null hypothesis (H0) is a statement of no effect, no difference, or no relationship
between variables. It assumes that any observed differences or relationships are due to
chance or random variation.

2] Alternative Hypothesis:
The alternative hypothesis (Ha or H1) is a statement that contradicts or negates the null
hypothesis. It suggests that there is an effect, a difference, or a relationship between
variables that is not due to chance.

3] Examples of Null and Alternative Hypotheses:


Example 1:

Null Hypothesis: The mean test scores of students who receive tutoring are the same as
those who do not receive tutoring.

Alternative Hypothesis: The mean test scores of students who receive tutoring are different
from those who do not receive tutoring.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 4/13


5/26/23, 9:45 PM Notebooks

Example 2:

Null Hypothesis: There is no relationship between job satisfaction and employee


productivity.

Alternative Hypothesis: There is a relationship between job satisfaction and employee


productivity.

Example 3:

Null Hypothesis: The marketing campaign did not lead to an increase in product sales.

Alternative Hypothesis: The marketing campaign led to an increase in product sales.

Q7: Write down the steps involved in hypothesis


testing.

First Step
Null Hypothesis (H0):

This states that there is no significant difference between the population parameter and
the hypothesized value.

Second Step
Alternative Hypothesis (H1):

This states that there is a significant difference between the population parameter and the
hypothesized value.

Third Step
Set the Significance Level (α):

The significance level, denoted by α, determines the probability of rejecting the null
hypothesis when it is actually true. Commonly used significance levels are 0.05 (5%) or 0.01
(1%).

Fourth Step
Conduct the Test:

Calculate the test statistic: For a z-test, it is the observed sample statistic minus the
hypothesized population parameter, divided by the standard error.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 5/13


5/26/23, 9:45 PM Notebooks

Determine the critical value: The critical value(s) is obtained from the z-table or a statistical
software based on the chosen significance level. Compare the test statistic with the critical
value(s): If the test statistic falls in the critical region (beyond the critical value(s)), the null
hypothesis is rejected. Otherwise, it is not rejected.

Fifth Step
Make a Conclusion:

Based on the comparison in step 3, either reject the null hypothesis or fail to reject it. If the
null hypothesis is rejected, it suggests that there is sufficient evidence to support the
alternative hypothesis. If the null hypothesis is not rejected, it means there is insufficient
evidence to support the alternative hypothesis.

Q8. Define p-value and explain its significance in


hypothesis testing.

Definition:
The p-value is a statistical measure that quantifies the strength of evidence against the null
hypothesis in hypothesis testing. It represents the probability of obtaining a test statistic as
extreme as, or more extreme than, the observed value, assuming that the null hypothesis is
true.

Significance in Hypothesis Testing:


The p-value plays a crucial role in hypothesis testing and is used to make decisions about
the null hypothesis. Its significance can be understood as follows:

If the p-value is less than the predetermined significance level (α), typically 0.05 or 0.01, it
suggests that the observed data is statistically significant. This means that the observed
result is unlikely to have occurred by chance alone, leading to the rejection of the null
hypothesis in favor of the alternative hypothesis.

If the p-value is greater than the significance level (α), it indicates that the observed data
does not provide strong evidence against the null hypothesis. In this case, the null
hypothesis is not rejected, and it is concluded that the observed result could plausibly
occur due to chance or random variability.

The smaller the p-value, the stronger the evidence against the null hypothesis. A very small
p-value indicates a low probability of observing the data if the null hypothesis is true,
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 6/13
5/26/23, 9:45 PM Notebooks

suggesting a stronger case for rejecting the null hypothesis.

Q9. Generate a Student's t-distribution plot using


Python's matplotlib library, with the degrees of
freedom parameter set to 10.
In [5]:
"""numpy (imported as np) is a library for numerical operations in Python.
matplotlib.pyplot (imported as plt) is a plotting library that allows us to creat
scipy.stats.t is a module from the SciPy library that provides functions related

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t

"""The degrees of freedom (df) parameter determines the shape of the Student's t-
In this code, we set it to 10."""
df = 10

"""We use np.linspace to create an array of 100 equally spaced x values that span
The t.ppf function is used to calculate the percent point function (inverse cumul
for the given degrees of freedom."""
x = np.linspace(t.ppf(0.001, df), t.ppf(0.999, df), 100)

"""We use t.pdf to compute the probability density function (PDF) for the Student
given the x values and degrees of freedom."""
pdf = t.pdf(x, df)

# Plot the t-distribution


plt.plot(x, pdf, label=f"t-distribution (df={df})")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.title("Student's t-distribution")
plt.legend()
plt.grid(True)
plt.show()

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 7/13


5/26/23, 9:45 PM Notebooks

Q10. Write a Python program to calculate the two-


sample t-test for independent samples, given two
random samples of equal size and a null
hypothesis that the population means are equal.
In [6]:
#Null hypothesis[Ho]: the population means are equal.
#Alternate hypothesis[H1]: the population means are not equal.

# Significance Value = 0.05


import scipy.stats as stat
#Assume the following Sample data:
import pandas as pd
df=pd.DataFrame({"Data":['mean','std','n'],"A": [1.3,0.5,22],"B":[1.6,0.3,22]})

from math import sqrt

t=(1.3-1.6)/sqrt((0.25/22)+(0.09/22))

dof = 22+22-2

p_value = 2 * (1 - stat.t.cdf(abs(t), dof))

print ( t, p_value )

-2.4131989996195315 0.020254894248790345

Q11: What is Student’s t distribution? When to use


the t-Distribution.
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 8/13
5/26/23, 9:45 PM Notebooks

Student's t-distribution is a probability distribution used when estimating the mean of a


normally distributed population with a small sample size or an unknown population
standard deviation.

The t-distribution is used in hypothesis testing and constructing confidence intervals when
the sample size is small (typically less than 30) and the population standard deviation is
unknown, or when the population is approximately normally distributed or the sample size
is large enough to apply the Central Limit Theorem. It provides critical values and
probabilities for t-tests, which compare sample means and assess the significance of the
difference between them.

Q12: What is t-statistic? State the formula for t-


statistic.
The t-statistic is a measure used in hypothesis testing to determine the significance of the
difference between sample means or the relationship between a sample mean and a
population mean. It quantifies the difference between the sample statistic (e.g., sample
mean) and the hypothesized population parameter (e.g., population mean) relative to the
variability in the data.

The formula for the t-statistic is:

t = (sample statistic - hypothesized parameter) / standard error

In this formula, the sample statistic represents the observed value from the sample data
(e.g., sample mean), the hypothesized parameter is the value assumed under the null
hypothesis (e.g., population mean), and the standard error measures the variability or
uncertainty in the sample statistic. The t-statistic indicates how many standard errors the
sample statistic deviates from the hypothesized parameter. By comparing the calculated t-
value to critical values from the t-distribution, we can determine the statistical significance
of the observed difference or relationship.

Q13. A coffee shop owner wants to estimate the


average daily revenue for their shop. They take a
random sample of 50 days and find the sample
mean revenue to be 500 dollor with a standard
deviation of 50 dollor. Estimate the population
mean revenue with a 95% confidence interval.
In [7]:
from math import sqrt

In [8]:
# Lets consider the following data.
n = 50
sample_mean = 500
std = 50
confidence_interval = 95
sigma = 0.05

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 9/13


5/26/23, 9:45 PM Notebooks

lower_ci= sample_mean - 1.96 * (std/(sqrt (n)))


upper_ci= sample_mean + 1.96 * (std/(sqrt (n)))

print("I am 95% confident that the population mean lies between",lower_ci ," and

I am 95% confident that the population mean lies between 486.1407070887437 and 51
3.8592929112564

Q14. A researcher hypothesizes that a new drug


will decrease blood pressure by 10 mmHg. They
conduct a clinical trial with 100 patients and find
that the sample mean decrease in blood pressure
is 8 mmHg with a standard deviation of 3 mmHg.
Test the hypothesis with a significance level of
0.05.
Null Hypothesis: New drug will decrease blood pressure by 10 mmHg.

Alternate Hypothesis: New drug will not decrease blood pressure by 10 mmHg.

Significance level : 0.05

1-0.05/2 =0.9750

Note : Z-score of 0.9750 is 1.96"

Std = 3

n=100

Sample_mean= 8

In [20]:
from math import sqrt

In [10]:
Z_test= (8-10)/ (3/(sqrt(100) ))

In [11]:
Z_test

Out[11]: -6.666666666666667

In [12]:
if Z_test < -1.96:
print ("We reject the null hypothesis")

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 10/13


5/26/23, 9:45 PM Notebooks

elif Z_test > 1.96:


print ("We reject the null hypothesis")
else:
print("We fail to reject the null hypothesis")

We reject the null hypothesis

Q15. An electronics company produces a certain


type of product with a mean weight of 5 pounds
and a standard deviation of 0.5 pounds. A random
sample of 25 products is taken, and the sample
mean weight is found to be 4.8 pounds. Test the
hypothesis that the true mean weight of the
products is less than 5 pounds with a significance
level of 0.01.
Null Hypothesis: Mean weight is 5 pounds

Alternate Hypothesis: Mean weight is not 5 pounds

Significance level : 0.01 t-value = 2.797

Degree of freedom(df)= 25-1 = 24

In [13]:
t= (4.8-5) / (0.5 /sqrt(25))

In [14]:
t

Out[14]: -2.0000000000000018

In [15]:
if t <-2.797:
Print("We Reject the null hypothesis")
elif t > 2.7979:
print("We Reject the null hypothesis")
else:
print("We fail to reject the null hypothesis")

We fail to reject the null hypothesis

Conclusion: Mean weight is 5 pounds

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 11/13


5/26/23, 9:45 PM Notebooks

Q16. Two groups of students are given different


study materials to prepare for a test. The first
group (n1 = 30) has a mean score of 80 with a
standard deviation of 10, and the second group (n2
= 40) has a mean score of 75 with a standard
deviation of 8. Test the hypothesis that the
population means for the two groups are equal
with a significance level of 0.01.
Null Hypothesis: The population means for the two groups are equal.

Alternate Hypothesis: The population means for the two groups are not equal.

Significance level = 0.01

In [16]:
t=(80 - 75)/( sqrt(((10**2)/ 30)+((8**2)/40)))

In [17]:
t

Out[17]: 2.2511258444537408

For α = 0.01 (99% confidence level):

The critical value for a two-tailed test with df = 68 is approximately ±2.663.

In [19]:
if t > 2.663 :
print("We reject the Null Hypothesis")
elif t < -2.663:
print("We reject the Null Hypothesis")
else :
print ("We fail to reject the Null Hypothesis")

We fail to reject the Null Hypothesis

Conclusion : The population means for the two groups are


equal.

Q17. A marketing company wants to estimate the


average number of ads watched by viewers during
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 12/13
5/26/23, 9:45 PM Notebooks

a TV program. They take a random sample of 50


viewers and find that the sample mean is 4 with a
standard deviation of 1.5. Estimate the population
mean with a 99% confidence interval.
Null Hypothesis: The population mean is equal to sample mean (ie:4)

Alternate Hypothesis:The population mean is not equal to sample mean (ie:4)

Significance level : 0.01

1-0.01/2 =0.999

Note : Z-score of 0.99 is 2.680.

Std = 1.5, n=50, Sample_mean= 4

Standard error (SE) = s / √n SE = 1.5 / √50 SE ≈ 0.2121

In [23]:
Lower_Confidence_Interval = 4 - (2.680 * 0.2121)
Upper_Confidence_Interval = 4 + 0.5685

In [33]:
print("I am 99% confident that the population mean lies between ", Lower_Confiden

I am 99% confident that the population mean lies between 3.431572 and 4.5685

THE END

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 3.ipynb 13/13


5/26/23, 9:51 PM Notebooks

Statistics Advance Assignment 4


Q1: What is the difference between a t-test and a z-test?
Provide an example scenario where you would use each type
of test.

Do you know the population standard deviation ?

if No:
then you should use T-test.

if Yes:
Is the sample size greater than 30?

if yes, Use Z-test


else use t-test.

Q2: Differentiate between one-tailed and two-tailed tests.


The main difference between one-tailed and two-tailed tests is that one-tailed tests only
have one critical region, while two-tailed tests have two critical regions.

In a one-tailed test, the critical region is located in one tail of the distribution, either the
left or the right. This means that the test is only looking for evidence of a difference in one
direction. For example, a one-tailed test could be used to determine whether a new drug is
more effective than a placebo in reducing pain. The critical region would be located in the
right tail of the distribution, because the researchers are only interested in finding evidence
that the drug is more effective than the placebo. For example: Does taking a new
medication improve the patient's condition?

In a two-tailed test, the critical region is located in both tails of the distribution. This means
that the test is looking for evidence of a difference in either direction. For example, a two-
tailed test could be used to determine whether there is a difference in the average height
of men and women. The critical region would be located in both the left and right tails of
the distribution, because the researchers are interested in finding evidence that men are
taller than women, or that women are taller than men.

For example:

Is there a difference in the average IQ of men and women?

Does taking a new medication improve the patient's condition, regardless of whether the
patient's condition is getting better or worse?

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 1/10


5/26/23, 9:51 PM Notebooks

Q3: Explain the concept of Type 1 and Type 2 errors in


hypothesis testing. Provide an example scenario for each
type of error.

Reality: Null Hypothesis is True or Null HYpothesis is False.

Decision: Null Hypothesis is True or Null hypothesis is False.

Outcomes:
Outcome 1: We reject the Null hypothesis when Null hypothesis is False in Reality.
{Good}

Exapmle: A researcher is interested in whether or not a new drug is effective in treating


depression. The researcher conducts a study and finds that the drug is effective in reducing
depressive symptoms. The researcher rejects the null hypothesis, which states that the drug is
not effective in treating depression. The researcher concludes that the drug is effective in
treating depression. The null hypothesis was false in reality. The drug was actually effective in
treating depression. The researcher correctly rejected the null hypothesis and concluded that
the drug is effective.

Outcome 2: We reject the Null hypothesis when Null hypothesis is True in Reality.
{Type 1 Error}

Example: A researcher conducts a study to test the effectiveness of a new drug. The
researcher rejects the null hypothesis and concludes that the drug is effective. However, the
drug is actually not effective and the researcher has made a Type I error.

Outcome 3: We accept the Null hypothesis when Null hypothesis is False in Reality.
{Type 2 Error}

Example:A researcher conducts a study to test the effectiveness of a new drug. The researcher
does not reject the null hypothesis and concludes that the drug is not effective. However, the
drug is actually effective and the researcher has made a Type II error.

Outcome 4 : We accept the Null hypothesis when Null hypothesis is True in Reality.
{Good}

Example: A researcher is interested in whether or not there is a difference in the average


height of men and women. The researcher conducts a study and finds that there is no
difference in the average height of men and women. The researcher accepts the null
hypothesis, which states that there is no difference in the average height of men and women.
The researcher concludes that there is no difference in the average height of men and
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 2/10
5/26/23, 9:51 PM Notebooks

women. The null hypothesis was true in reality. There was actually no difference in the
average height of men and women. The researcher correctly accepted the null hypothesis and
concluded that there is no difference in the average height of men and women.

Q4: Explain Bayes's theorem with an example.

P(A|B) = P(B|A) * P(A) / P(B)


where:

P(A|B) is the probability of event A occurring, given that event B has occurred
P(B|A) is the probability of event B occurring, given that event A has occurred
P(A) is the probability of event A occurring
P(B) is the probability of event B occurring

For example, let's say that you have a bag of marbles that contains 10 red marbles and 10
blue marbles. You reach into the bag and pull out a marble without looking. What is the
probability that the marble you pulled out is red?

Without any other information, we can say that the probability of pulling out a red marble
is 50%. This is because there are an equal number of red and blue marbles in the bag.

Now, let's say that you look at the marble and see that it is red. What is the probability that
the marble you pulled out is red now?

Using Bayes' theorem, we can update our probability to 75%. This is because we now know
that the marble is red, and we also know that there are more red marbles in the bag than
blue marbles.

Bayes' theorem can be used in a variety of applications, including medical diagnosis,


weather forecasting, and financial forecasting. It is a powerful tool that can be used to
make better decisions based on incomplete information.

Q5: What is a confidence interval? How to calculate the


confidence interval, explain with an example.
A confidence interval is a range of values that is likely to contain the true value of a
population parameter. The confidence level is the percentage of times that the confidence
interval will contain the true value of the population parameter.We construct Confidence
Interval to help estimate what the value of unknown population mean is.

Formula : CI = x̄ ± zα/2 * σ / √n

where:

CI is the confidence interval


x̄ is the sample mean
σ is the population standard deviation
zα/2 is the z-score for the desired confidence level
n is the sample size
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 3/10
5/26/23, 9:51 PM Notebooks

For example, a 95% confidence interval means that there is a 95% chance that the
confidence interval will contain the true population mean.

For example, let's say that we want to calculate a 95% confidence interval for the average
height of men. We know that the sample mean is 68 inches, the sample standard deviation
is 2 inches, and the sample size is 100.

In [4]:
from math import sqrt
# CI = 68 ± 1.96 * 2 / √100
Lower_CI=68 - 1.96 * 2 / sqrt(100)
Upper_CI=68 + 1.96 * 2 / sqrt(100)
Lower_CI,Upper_CI

Out[4]: (67.608, 68.392)

This means that we can be 95% confident that the true average height of men is between
67.61 inches and 68.39 inches.

Q6. Use Bayes' Theorem to calculate the probability of an


event occurring given prior knowledge of the event's
probability and new evidence. Provide a sample problem and
solution.

Problem:
A certain disease affects 1 in every 1000 people. A test has been developed to detect the
disease, and it is known to have a 95% accuracy rate (i.e. if a person has the disease, the
test will correctly identify it 95% of the time, and if a person does not have the disease, the
test will correctly identify it as negative 95% of the time). If a randomly selected person
tests positive for the disease, what is the probability that they actually have the disease?

A: The person has the disease. B: The person tests positive for the disease.

hint: P(B) = 0.05185

P(A) = 1/1000

P(B|A) = 0.95

P(B|not A) = 0.05

We want to find: P(A|B) (the probability of the person having the disease given that they
tested positive)

Using Bayes' Theorem: P(A|B) = (P(B|A) * P(A)) / P(B)

In [13]:
probability_A_given_B=((0.95) *(1/1000)) / (0.05185)

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 4/10


5/26/23, 9:51 PM Notebooks

Therefore, if a randomly selected person tests positive for the disease, the probability that
they actually have the disease is approximately 1.83%.

Q7. Calculate the 95% confidence interval for a sample of


data with a mean of 50 and a standard deviation of 5.
Interpret the results.

CI = x̄ ± zα/2 * σ / √n
where:

CI is the confidence interval


x̄ is the sample mean
σ is the population standard deviation
zα/2 is the z-score for the desired confidence level
n is the sample size

In [24]:
# Assuming n = 1000
Lower_Ci= 50 - 1.96 * 5 / 1000
Upper_Ci= 50 + 1.96 * 5 / 1000
print("I am 95% confident that the mean lies between " + str(Lower_Ci) + " and "

I am 95% confident that the mean lies between 49.9902 and 50.0098.

Q8. What is the margin of error in a confidence interval? How


does sample size affect the margin of error? Provide an
example of a scenario where a larger sample size would
result in a smaller margin of error.
The margin of error is a measure of the uncertainty or variability associated with estimating
a population parameter based on a sample. In the context of a confidence interval, it
represents the range within which we expect the true population parameter to fall.

A larger sample size tends to result in a smaller margin of error. This is because as the
sample size increases, the standard error decreases. The standard error is inversely
proportional to the square root of the sample size. Therefore, a larger sample size leads to
a more precise estimate and a narrower confidence interval.

Margin_of_error= zα/2 * σ / √n

CI = x̄ ± Margin_of_error
Lets consider 2 scenarios:

n=100
n=1000

CI=95%, std=5 and sample_mean=50

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 5/10


5/26/23, 9:51 PM Notebooks

In [31]:
Margin_of_Error_1= 1.96 * 5 / 500
print("Result for Scenario 1 : " + str(Margin_of_Error_1))
Margin_of_Error_2= 1.96 * 5 / 100
print("Result for Scenario 2 : " + str(Margin_of_Error_2))

Result for Scenario 1 : 0.019600000000000003


Result for Scenario 2 : 0.098

Hence we can conclude that larger sample size would result


in a smaller margin of error.¶

Q9. Calculate the z-score for a data point with a value of 75,
a population mean of 70, and a population standard
deviation of 5. Interpret the results.

z = (x - μ) / σ
Where:

x is the data point value,


μ is the population mean, and
σ is the population standard deviation.

In [1]:
z=(75-70)/5

In [2]:
z

Out[2]: 1.0

This means that the Datapoint 75 is 1 standard deviation away from the mean.

Q10. In a study of the effectiveness of a new weight loss


drug, a sample of 50 participants lost an average of 6 pounds
with a standard deviation of 2.5 pounds. Conduct a
hypothesis test to determine if the drug is significantly
effective at a 95% confidence level using a t-test.

Null Hypothesis H0: μ = 0

Alternate Hypothesis H1: μ ≠ 0

Significance level : 0.05, Confidence Interval : 95%

Degree of freedom = 50-1 = 49

t-table value of 0.05 for 2 tail test is 2.045

Decision Rule : if the t-test is greater than 2.045 or lesser than -2.045, reject the null
hypothesis.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 6/10


5/26/23, 9:51 PM Notebooks

t = (x̄ - μ) / (s / √n)

In [2]:
from math import sqrt
t = (6 - 0) / (2.5 / sqrt(50))

In [4]:
if t > 2.045 :
print("We reject the Null Hypothesis")
elif t < -2.045:
print("We reject the Null Hypothesis")
else:
print("We fail to reject the Null Hypothesis ")

We reject the Null Hypothesis

Since the absolute value of the calculated t-statistic (16.949) is much


greater than the critical t-value (2.0096), we reject the null hypothesis.

Q11. In a survey of 500 people, 65% reported being satisfied


with their current job. Calculate the 95% confidence interval
for the true proportion of people who are satisfied with their
job.

CI = x̄ ± zα/2 * σ / √n
where:

CI is the confidence interval


x̄ is the sample mean
σ is the population standard deviation
zα/2 is the z-score for the desired confidence level
n is the sample size

In [6]:
Standard_Error = sqrt((0.65 * (1 - 0.65)) / 500)

In [7]:
lower_ci = 65 - 1.96* Standard_Error

In [8]:
upper_ci=65 + 1.96 * Standard_Error

In [13]:
print("I am 95% confident that the population mean lies between "+ str(lower_ci)+

I am 95% confident that the population mean lies between 64.95819177114491 and 65.0
4180822885509

Q12. A researcher is testing the effectiveness of two different


teaching methods on student performance. Sample A has a
mean score of 85 with a standard deviation of 6, while
sample B has a mean score of 82 with a standard deviation of
5. Conduct a hypothesis test to determine if the two teaching
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 7/10
5/26/23, 9:51 PM Notebooks

methods have a significant difference in student


performance using a t-test with a significance level of 0.01.

Null Hypothesis [H0] : μ1=μ2

Alternate Hypothesis [H1] : μ1≠μ2

Significance level : 0.01

assuming n to be 100
Degree of freedom : 100 + 100 -2 =198

The critical t-value for α/2 = 0.005 and df = 198 is approximately ±2.617.

Decision Rule : If the t-test is lesser than -2.617 or greater than +2.617, We reject the
Null hypothesis.

calculating t-statistics

In [1]:
from math import sqrt
t=(85-82)/ sqrt (((6**2)/100)+((5**2)/100) )

In [4]:
if t < -2.617:
print("We reject the Null hypothesis.")
elif t> +2.617:
print("We reject the Null hypothesis.")
else:
print("We fail to reject the Null hypothesis.")

We reject the Null hypothesis.

Conclusion:
Based on the given data and conducting the two-sample t-test, we have evidence to
suggest that there is a significant difference in student performance between the two
teaching methods at a significance level of 0.01.

Q13. A population has a mean of 60 and a standard deviation


of 8. A sample of 50 observations has a mean of 65. Calculate
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 8/10
5/26/23, 9:51 PM Notebooks

the 90% confidence interval for the true population mean.

Confidence Interval = x̄ ± zα/2 * σ / √n


where:

CI is the confidence interval


x̄ is the sample mean
σ is the population standard deviation
zα/2 is the z-score for the desired confidence level
n is the sample size

The z-score for a 90% confidence interval is 1.645.

In [5]:
Lower_ci= 65 -( 1.645* (8 /sqrt(50)))

In [6]:
Upper_ci= 65 +( 1.645* (8 /sqrt(50)))

In [9]:
print("I am 95% confident that the population mean lies between "+ str(Lower_ci)+

I am 95% confident that the population mean lies between 63.13889495191701 and 66.8
6110504808299

Q14. In a study of the effects of caffeine on reaction time, a


sample of 30 participants had an average reaction time of
0.25 seconds with a standard deviation of 0.05 seconds.
Conduct a hypothesis test to determine if the caffeine has a
significant effect on reaction time at a 90% confidence level
using a t-test.

Null Hypothesis [H0] : μ=0.25

Alternate Hypothesis [H1] : μ ≠ 0.25

Significance level : 0.10

assuming n to be 100
Degree of freedom : 30-1=29

The critical t-value for α/2 = 0.05 and df = 29 is approximately ± 2.045 .

Decision Rule : If the t-test is lesser than -2.045 or greater than + 2.045 , We reject the
Null hypothesis.

t = (sample mean - population mean) / (sample standard


deviation / sqrt(sample size))

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 9/10


5/26/23, 9:51 PM Notebooks

In [23]:
t = (0.25 - 0) / (0.05 / sqrt(30))

In [22]:
if t < -2.045:
print("We reject the Null hypothesis.")
elif t> +2.045:
print("We reject the Null hypothesis.")
else:
print("We fail to reject the Null hypothesis.")

We reject the Null hypothesis.

THE END

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 4.ipynb 10/10


5/26/23, 9:51 PM Notebooks

Statistics Advance Assignment 5


Q1. Calculate the 95% confidence interval for a sample of
data with a mean of 50 and a standard deviation of 5 using
Python. Interpret the results.

Confidence Interval = x̄ ± zα/2 * σ / √n


where:

CI is the confidence interval


x̄ is the sample mean
σ is the population standard deviation
zα/2 is the z-score for the desired confidence level
n is the sample size

Assuming n to be 100

Z-score of 95% Confidence interval is 1.96

In [1]:
from math import sqrt
Lower_ci= 50 - ( 1.96* (5/sqrt(100)))
Upper_ci= 50 + ( 1.96* (5/sqrt(100)))

In [2]:
print("I am 95% confident that the population mean lies between "+ str(Lower_ci)+

I am 95% confident that the population mean lies between 49.02 and 50.98

Q2. Conduct a chi-square goodness of fit test to determine if


the distribution of colors of M&Ms in a bag matches the
expected distribution of 20% blue, 20% orange, 20% green,
10% yellow, 10% red, and 20% brown. Use Python to perform
the test with a significance level of 0.05.
Lets consider the following as observed data:

In [3]:
import pandas as pd
df=pd.DataFrame({"Colours":["Blue","Orange","Green","Yellow","Red","Brown"],"Expe

In [4]:
sum(df['Observed_Data'])

Out[4]: 300

In [5]:
df['Expected_Data'] = (df['Expected_Data(in %)']/100 )*300

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 1/9


5/26/23, 9:51 PM Notebooks

In [6]:
df

Out[6]: Colours Expected_Data(in %) Observed_Data Expected_Data

0 Blue 20 45 60.0

1 Orange 20 55 60.0

2 Green 20 50 60.0

3 Yellow 10 30 30.0

4 Red 10 25 30.0

5 Brown 20 95 60.0

In [7]:
import scipy.stats as stat
chisquare_test_statistics,p_value=stat.chisquare(df['Observed_Data'],df['Expected

In [8]:
chisquare_test_statistics,p_value

Out[8]: (27.083333333333336, 5.4949987771074724e-05)

In [9]:
# find the critical value
significance_value=0.05
dof=len(df['Expected_Data']) -1
crtitcal_value= stat.chi2.ppf(1-significance_value,dof)

In [10]:
if chisquare_test_statistics > crtitcal_value:
print ("we reject the null hypothesis")
else :
print ("we fail to reject the null hypothesis")

we reject the null hypothesis

Conclusion: Expected data matches the Observed Data

Q3. Use Python to calculate the chi-square statistic and p-


value for a contingency table with the following data:
In [11]:
data=({"Outcome":["Outcome1","Outcome2","Outcome3"],"Group A":[20,10,15],"Group B

In [12]:
import pandas as pd
df1 = pd.DataFrame(data)
df1 = df1.set_index('Outcome')

In [13]:
df1

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 2/9


5/26/23, 9:51 PM Notebooks

Out[13]: Group A Group B

Outcome

Outcome1 20 15

Outcome2 10 25

Outcome3 15 20

In [14]:
from scipy.stats import chi2_contingency

# Define the contingency table


observed = [[20, 15],
[10, 25],
[15, 20]]

# Perform chi-square test


chi2, p_value, dof, expected = chi2_contingency(observed)

# Print the results


print("Chi-square statistic:", chi2)
print("p-value:", p_value)

Chi-square statistic: 5.833333333333334


p-value: 0.05411376622282158

Q4. A study of the prevalence of smoking in a population of


500 individuals found that 60 individuals smoked. Use
Python to calculate the 95% confidence interval for the true
proportion of individuals in the population who smoke.
In [15]:
import math

sample_size = 500
sample_proportion = 60 / sample_size
confidence_level = 0.95

Margin_of_error= zα/2 * σ / √n

CI = x̄ ± Margin_of_error

standard_error = sqrt((p * (1 - p)) / n)


In [16]:
import scipy.stats as stat
# Calculate the standard error
standard_error = math.sqrt((sample_proportion * (1 - sample_proportion)) / sample

In [17]:
# Calculate the margin of error
margin_of_error = stat.norm.ppf(1 - (1 - confidence_level) / 2) * standard_error

# Calculate the lower and upper bounds of the confidence interval


lower_bound = sample_proportion - margin_of_error
upper_bound = sample_proportion + margin_of_error

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 3/9


5/26/23, 9:51 PM Notebooks

# Print the confidence interval


print("Confidence Interval: [{:.4f}, {:.4f}]".format(lower_bound, upper_bound))

Confidence Interval: [0.0915, 0.1485]

Q5. Calculate the 90% confidence interval for a sample of


data with a mean of 75 and a standard deviation of 12 using
Python. Interpret the results.
In [18]:
import scipy.stats as stats

confidence_level = 0.9

# Calculate the Z-score


z_score = stats.norm.ppf(1 - (1 - confidence_level) / 2)

# Print the Z-score


print("Z-score for a 90% confidence interval:", z_score)

Z-score for a 90% confidence interval: 1.6448536269514722

In [19]:
from math import sqrt

mean=75
standard_deviation=12
confidence_level = 0.90
sample_size=100

margin_of_error = 1.65 * (standard_deviation/ sqrt(sample_size) )

In [20]:
lower_bound = mean - margin_of_error
upper_bound = mean + margin_of_error

In [21]:
print("Confidence Interval: [{:.4f}, {:.4f}]".format(lower_bound, upper_bound))

Confidence Interval: [73.0200, 76.9800]

Q6. Use Python to plot the chi-square distribution with 10


degrees of freedom. Label the axes and shade the area
corresponding to a chi-square statistic of 15.
In [22]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

In [23]:
df = 10

defines the degrees of freedom (df) for the chi-square distribution. In this case, we set it to
10.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 4/9


5/26/23, 9:51 PM Notebooks

In [24]:
x = np.linspace(0, 30, 500)

generates an array of 500 evenly spaced values ranging from 0 to 30. These values will be
used as the x-axis values for plotting the chi-square distribution.

In [25]:
pdf = stats.chi2.pdf(x, df)

This line calculates the probability density function (PDF) of the chi-square distribution
using the pdf() function from scipy.stats.chi2. It takes the array of x values (x) and the
degrees of freedom (df) as input and returns the corresponding PDF values.

In [26]:
plt.plot(x, pdf)
x_fill = np.linspace(15, 30, 500)
pdf_fill = stats.chi2.pdf(x_fill, df)
plt.fill_between(x_fill, pdf_fill, color='green', alpha=0.3)
plt.xlabel('Chi-square Statistic')
plt.ylabel('Probability Density Function (PDF)')
plt.show()

Q7. A random sample of 1000 people was asked if they


preferred Coke or Pepsi. Of the sample, 520 preferred Coke.
Calculate a 99% confidence interval for the true proportion
of people in the population who prefer Coke.
In [27]:
sample_size = 1000
sample_proportion = 520/1000
mean=520

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 5/9


5/26/23, 9:51 PM Notebooks

#Confidence_Interval is 99%
Sigma = 0.01

In [28]:
from math import sqrt
standard_error = sqrt((sample_proportion * (1 - sample_proportion)) / sample_size

In [29]:
Z_score = stats.norm.ppf(1 - (1 - 0.99) / 2)

In [30]:
margin_of_error = Z_score * standard_error

In [31]:
#Confidence_interval
lower_bound= mean - margin_of_error
upper_bound= mean + margin_of_error
lower_bound,upper_bound

Out[31]: (519.9593051576779, 520.0406948423221)

Conclusion:
This means that we are 99% confident that the true proportion of people in the population
who prefer Coke falls between approximately 519.9593 and 520.0407.

Q8. A researcher hypothesizes that a coin is biased towards


tails. They flip the coin 100 times and observe 45 tails.
Conduct a chi-square goodness of fit test to determine if the
observed frequencies match the expected frequencies of a
fair coin. Use a significance level of 0.05.
In [32]:
sample_size= 100
observed =[45,55]
expected = [50,50]
sigma = 0.05
chisquare_test_statistics,p_value=stat.chisquare(observed,expected)

In [33]:
chisquare_test_statistics,p_value

Out[33]: (1.0, 0.31731050786291115)

In [34]:
# find the critical value
significance_value=0.05
dof=len(expected) -1
crtitcal_value= stat.chi2.ppf(1-significance_value,dof)

In [35]:
if chisquare_test_statistics > crtitcal_value:
print ("we reject the null hypothesis")
else :
print ("we fail to reject the null hypothesis")

we fail to reject the null hypothesis

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 6/9


5/26/23, 9:51 PM Notebooks

Q9. A study was conducted to determine if there is an


association between smoking status (smoker or non-smoker)
and lung cancer diagnosis (yes or no). The results are shown
in the contingency table below. Conduct a chi-square test for
independence to determine if there is a significant
association between smoking status and lung cancer
diagnosis.Use a significance level of 0.05.
In [36]:
import pandas as pd

data = {
"Status": ["Smoker", "Non-smoker"],
"Lung Cancer Yes": [60, 30],
"Lung Cancer No": [140, 170]
}

df = pd.DataFrame(data)
df = df.set_index('Status')
df

Out[36]: Lung Cancer Yes Lung Cancer No

Status

Smoker 60 140

Non-smoker 30 170

In [37]:
from scipy.stats import chi2_contingency

# Define the contingency table


observed = [[60, 140],
[30, 170]]

# Perform chi-square test


chi2, p_value, dof, expected = chi2_contingency(observed)

# Print the results


print("Chi-square statistic:", chi2)
print("p-value:", p_value)

Chi-square statistic: 12.057347670250895


p-value: 0.0005158863863703744

In [39]:
alpha = 0.05

# Compare p-value with significance level to make a decision


if p_value < alpha:
print("Reject the null hypothesis. There is a significant association between
else:
print("Fail to reject the null hypothesis. There is no significant associatio

Reject the null hypothesis. There is a significant association between smoking stat
us and lung cancer diagnosis.

Q10. A study was conducted to determine if the proportion


of people who prefer milk chocolate, dark chocolate, or white
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 7/9
5/26/23, 9:51 PM Notebooks

chocolate is different in the U.S. versus the U.K. A random


sample of 500 people from the U.S. and a random sample of
500 people from the U.K. were surveyed. The results are
shown in the contingency table below. Conduct a chi-square
test for independence to determine if there is a significant
association between chocolate preference and country of
origin.Use a significance level of 0.01.

In [43]:
observed=[[200 ,150, 150],
[225, 175, 100]]

chi2, p_value,dof, expected = chi2_contingency(observed)

print ("Chi-square statistic:" ,chi2)


print("p_value :", p_value)

Chi-square statistic: 13.393665158371041


p_value : 0.0012348168997745918

In [46]:
alpha = 0.01

# Compare p-value with significance level to make a decision


if p_value < alpha:
print("Reject the null hypothesis.")
else :
print ("We Fail to Reject the Null Hypothesis.")

Reject the null hypothesis.

Q11. A random sample of 30 people was selected from a


population with an unknown mean and standard deviation.
The sample mean was found to be 72 and the sample
standard deviation was found to be 10. Conduct a hypothesis
test to determine if the population mean is significantly
different from 70. Use a significance level of 0.05.
In [56]:
import scipy.stats as stats

# Sample statistics
sample_mean = 72
sample_std = 10
sample_size = 30

# Null hypothesis: Population mean is 70


null_mean = 70

# Calculate the t-statistic


t_stat = (sample_mean - null_mean) / (sample_std / (sample_size ** 0.5))

# Calculate the p-value


p_value = stats.t.sf(abs(t_stat), sample_size - 1) * 2

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 8/9


5/26/23, 9:51 PM Notebooks

# Significance level
alpha = 0.05

# Compare p-value with significance level to make a decision


if p_value < alpha:
print("Reject the null hypothesis. The population mean is significantly diffe
else:
print("Fail to reject the null hypothesis.",
"There is not enough evidence to conclude that the population mean is s

# Print the t-statistic and p-value


print("t-statistic:", t_stat)
print("p-value:", p_value)

Fail to reject the null hypothesis. There is not enough evidence to conclude that t
he population mean is significantly different from 70.
t-statistic: 1.0954451150103321
p-value: 0.28233623728606977

THE END

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 5 .ipynb 9/9


5/26/23, 9:52 PM Notebooks

Statistics Advance Assignment-6


Q1. Explain the assumptions required to use ANOVA and
provide examples of violations that could impact the validity
of the results.

Assumptions:
1. Normality of Sampling Distribution: The sample mean is normally distributed. Example
of Violation: if the data is highly skewed to one side, it may violate the normality
assumption.

2. Absence of Outliers: Any Outliers present in the data should be removed. Example of
Violation:if the data contains outliers, it may violate the normality assumption.

3. Homogenity of Variance: Each of the population has same variance. (σ1** 2) = (σ2** 2)
= (σ3** 2) Population variance in different level of each independent variable are
equal. Example of Violation: if one group has much larger variability compared to the
others, the assumption is violated.

4. Sample are Independent and Random: Example of Violation: if the same individuals are
measured multiple times in different groups, the independence assumption is violated.

If these assumptions are satisfied we can apply ANOVA.

Q2. What are the three types of ANOVA, and in what


situations would each be used?
The 3 types of ANOVA are as follows:

1] One Way ANOVA : One factor with atleast 2 levels, these levels are
independent.
eg: Doctor wants to test a new medication to decreasen headache.They split the partipants in
3 condtions [10mg, 20mg, 30mg]. Doctor asks the patients to rate thier headache on a scale
of [1-10]. In this example, Medication is a factor and 3 conditions are 3 levels.

2] Repeated Measures ANOVA : One factor with atleast 2 levels but ythe
levels are dependent.
eg: Consider a factor Running and levels as Day1, Day2 and Day3.

3] Factorial ANOVA : Two or more factors each of which with atleast 2


levels. Levels can be dependent or independent.
eg: Consider a Data consisting a factor Running and levels as Day1, Day2 and Day3. and
Another factor Gender and levels as Male and Female.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 1/13


5/26/23, 9:52 PM Notebooks

Q3. What is the partitioning of variance in ANOVA, and why


is it important to understand this concept?
In ANOVA, the partitioning of variance means dividing the total variability in the data into
different parts. This helps us understand how much of the overall variation is due to the
differences between groups we are comparing.

There are three main parts:


Between-group variance: This measures the differences between the groups being
compared. It tells us how much the groups differ from each other because of the
factor we are studying.

Within-group variance: This measures the variation within each group. It represents
the random or unexplained differences within the groups that are not related to the
factor we are studying.

Total variance: This is the overall variability in the data, which includes both the
differences between groups and the differences within groups.

Understanding this concept is important because it helps us:


Identify the main sources of variation: We can figure out how much of the variation is
due to the factor we are interested in and how much is just random variation.

Assess the strength of the effect: We can see how strong the relationship is between
the factor we are studying and the outcome.

Test hypotheses and draw conclusions: We can determine if the differences between
groups are statistically significant.

Q4. How would you calculate the total sum of squares (SST),
explained sum of squares (SSE), and residual sum of squares
(SSR) in a one-way ANOVA using Python?
In [1]:
Group1=[23,25,18]
Group2=[29,19,21]
Group3=[35,17]

In [2]:
import numpy as np

In [3]:
mean1=np.mean(Group1)
mean2=np.mean(Group2)
mean3=np.mean(Group3)

In [4]:
Grand_mean = (np.sum(Group1) + np.sum (Group2) + np.sum (Group3)) / ((len(Group1)

In [5]:
Grand_mean
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 2/13
5/26/23, 9:52 PM Notebooks

Out[5]: 23.375

Sum of squares within (SSW):

For each subject, compute the difference between its score and its group mean. You
thus have to compute each of the group means, and compute the difference between
each of the scores and the group mean to which that score belongs
Square all these differences
Sum the squared differences

In [6]:
sum1=[]
sum2=[]
sum3=[]
for i in Group1:
sum1.append((i - mean1)**2)
for i in Group2:
sum2.append((i - mean2)**2)
for i in Group3:
sum3.append((i - mean3)**2)

In [7]:
SSW=np.sum(sum1) + np.sum (sum2) + np.sum(sum3)
SSW

Out[7]: 244.0

Sum of squares between (SSB):

For each subject, compute the difference between its group mean and the grand
mean. The grand mean is the mean of all N scores (just sum all scores and divide by
the total sample size )
Square all these differences
Sum the squared differences

In [8]:
ssb1=[]
ssb2=[]
ssb3=[]
for i in Group1:
ssb1.append((mean1- Grand_mean)**2)
for i in Group2:
ssb2.append((mean2- Grand_mean)**2)
for i in Group3:
ssb3.append((mean3- Grand_mean)**2)

In [9]:
SSB=np.sum(ssb1) + np.sum (ssb2) + np.sum(ssb3)

In [10]:
SSB

Out[10]: 19.875

Sum of squares total (SST):

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 3/13


5/26/23, 9:52 PM Notebooks

For each subject, compute the difference between its score and the grand mean
Square all these differences
Sum the squared differences

In [11]:
sst1=[]
sst2=[]
sst3=[]
for i in Group1:
sst1.append((i- Grand_mean)**2)
for i in Group2:
sst2.append((i- Grand_mean)**2)
for i in Group3:
sst3.append((i- Grand_mean)**2)

In [12]:
SST= np.sum(sst1) + np.sum (sst2) + np.sum(sst3)

In [13]:
SST

Out[13]: 263.875

If you have computed two of the three sums of squares, you can easily computed the third
one by using the fact that SST = SSW + SSB.

Q5. In a two-way ANOVA, how would you calculate the main


effects and interaction effects using Python?
In [14]:
import numpy as np
import pandas as pd
from scipy import stats

In [15]:
# Create sample data for two factors (A and B) and the response variable (Y)
A = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
B = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

In [16]:
# Combine the data into a DataFrame
data = pd.DataFrame({'A': A, 'B': B, 'Y': Y})

In [17]:
data

Out[17]: A B Y

0 1 1 10

1 2 1 12

2 3 1 14

3 4 1 16

4 5 1 18

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 4/13


5/26/23, 9:52 PM Notebooks

A B Y

5 1 2 20

6 2 2 22

7 3 2 24

8 4 2 26

9 5 2 28

In [18]:
# Calculate the main effects and interaction effect
main_effect_A, pvalue_A = stats.f_oneway(data[data['B'] == 1]['Y'],
data[data['B'] == 2]['Y'])

main_effect_B, pvalue_B = stats.f_oneway(data[data['A'] == 1]['Y'],


data[data['A'] == 5]['Y'])

interaction_effect, pvalue_interaction = stats.f_oneway(data[data['A'] == 1]['Y']


data[data['A'] == 5]['Y']
data[data['A'] == 2]['Y']
data[data['A'] == 4]['Y']

print("Main Effect A:", main_effect_A)


print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)

Main Effect A: 25.0


Main Effect B: 1.28
Interaction Effect: 0.5333333333333333

Q6. Suppose you conducted a one-way ANOVA and obtained


an F-statistic of 5.23 and a p-value of 0.02. What can you
conclude about the differences between the groups, and how
would you interpret these results?

F-statistic:

The F-statistic (5.23 in this case) measures the extent of differences between the groups in
the one-way ANOVA. A higher F-statistic suggests that the differences between the groups
are more noticeable and significant. p-value:

P-value:

The p-value (0.02 in this case) provides a measure of the strength of evidence against the
null hypothesis. A p-value of 0.02 means that there is a 2% chance of obtaining such a
large F-statistic if there were actually no real differences between the groups.

Interpretation

Differences between the groups:

The obtained F-statistic of 5.23 indicates that there are noticeable differences between the
groups. These differences are not likely to occur by random chance alone. Statistical
significance:

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 5/13


5/26/23, 9:52 PM Notebooks

P-value:

The low p-value of 0.02 suggests strong evidence against the null hypothesis. It indicates
that the observed differences between the groups are unlikely to be due to random
variation alone.

Q7. In a repeated measures ANOVA, how would you handle


missing data, and what are the potential consequences of
using different methods to handle missing data?

Handling missing data:


Listwise deletion: Exclude any participant with missing data on any variable from the
analysis. It reduces the sample size but provides a complete case analysis.
Pairwise deletion: Use available data for each specific comparison. It retains more
participants but can result in different sample sizes for different comparisons.
Imputation: Fill in missing values with estimated values. It allows for all participants to
be included, but the accuracy of results depends on the chosen imputation method.

Potential consequences of different methods:


Listwise deletion: May introduce bias if missingness is related to the variables being
analyzed. It reduces the sample size and can limit generalizability.
Pairwise deletion: Retains more participants but can result in different sample sizes for
different comparisons, affecting precision and generalizability.
Imputation: Keeps all participants and maintains statistical power. However, the
accuracy of results depends on the chosen imputation method, and imputed values
introduce uncertainty.

Q8. What are some common post-hoc tests used after


ANOVA, and when would you use each one? Provide an
example of a situation where a post-hoc test might be
necessary.
After conducting an ANOVA with multiple groups, we often use post-hoc tests to compare
the groups and see which ones are different from each other. Some common post-hoc
tests are Tukey's HSD, Bonferroni correction, and Sidak correction. These tests help
determine which group means are significantly different from each other. For example, if
we conducted an ANOVA comparing the effectiveness of different treatments, a post-hoc
test would help us identify which treatments show significant differences in their
effectiveness.

Q9. A researcher wants to compare the mean weight loss of


three diets: A, B, and C. They collect data from 50
participants who were randomly assigned to one of the diets.
Conduct a one-way ANOVA using Python to determine if
there are any significant differences between the mean

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 6/13


5/26/23, 9:52 PM Notebooks

weight loss of the three diets. Report the F-statistic and p-


value, and interpret the results.
In [19]:
import pandas as pd
# Weight loss from each of the diet
A = [2, 4, 3, 5, 1, 2, 3, 4, 2, 3, 4, 5, 3, 2, 1, 2, 3, 4, 3, 5, 4, 2, 3, 1, 4,
3, 2, 3, 2, 4, 3, 2, 1, 2, 3, 4, 5, 3, 2, 1, 2, 3, 4, 3, 5, 4, 2, 3, 1,2]
B = [3, 1, 2, 1, 2, 3, 4, 3, 2, 3, 4, 5, 3, 2, 1, 2, 3, 4, 3, 5, 4, 2, 3, 1, 4,
3, 2, 3, 2, 4, 3, 2, 1, 2, 3, 4, 5, 3, 2, 1, 2, 3, 4, 3, 5, 4, 2, 3, 1,2]
C = [1, 2, 3, 4, 3, 5, 4, 2, 3, 1, 4, 3, 2, 3, 2, 4, 3, 2, 1, 2, 3, 4, 5, 3, 2,
1, 2, 3, 4, 3, 5, 4, 2, 3, 1, 2, 3, 4, 5, 3, 2, 1, 2, 3, 4, 3, 5, 4, 2,1]

Null Hypothesis [Ho] : µA = µB = µC

Alternate Hypothesis[H1] : µA is not equal to µB is not equal to µC.

Significance Value : α = 0.05 (Assuming)

Degree of freedom : N = 150, n=50, a=3

df between = a-1 =2
df within = N-a = 150 -3 = 147
df total = 150-1 = 149

In [20]:
import scipy.stats as stat
#Degrees of freedom
df_between = 2
df_within = 147
df_total = 149

# Significance level (alpha)


alpha = 0.05

# Calculate the critical value


critical_value = stat.f.ppf(1 - alpha, df_between, df_within)

# Print the critical value


print("Critical value:", critical_value)

Critical value: 3.057620651649394

In [21]:
mean1=np.mean(A)
mean2=np.mean(B)
mean3=np.mean(C)

In [22]:
Grand_mean = (np.sum(A) + np.sum (B) + np.sum (C)) / ((len(A)+len(B)+len(C)))

In [23]:
# Calculating Sum of Square Within
ssw1=[]
ssw2=[]
ssw3=[]
for i in A:
ssw1.append((i - mean1)**2)
for i in B:
ssw2.append((i - mean2)**2)

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 7/13


5/26/23, 9:52 PM Notebooks

for i in C:
ssw3.append((i - mean3)**2)

In [24]:
SSW=np.sum(ssw1) + np.sum (ssw2) + np.sum(ssw3)
SSW

Out[24]: 201.88

In [25]:
# Calculating Sum of Square Between
ssb1=[]
ssb2=[]
ssb3=[]
for i in A:
ssb1.append((mean1- Grand_mean)**2)
for i in B:
ssb2.append((mean2- Grand_mean)**2)
for i in C:
ssb3.append((mean3- Grand_mean)**2)
SSB=np.sum(ssb1) + np.sum (ssb2) + np.sum(ssb3)
SSB

Out[25]: 0.2800000000000005

In [26]:
# Calculating Sum of Square Total
sst1=[]
sst2=[]
sst3=[]
for i in A:
sst1.append((i- Grand_mean)**2)
for i in B:
sst2.append((i- Grand_mean)**2)
for i in C:
sst3.append((i- Grand_mean)**2)

In [27]:
SST= np.sum(sst1) + np.sum (sst2) + np.sum(sst3)

In [28]:
SST

Out[28]: 202.16000000000003

In [29]:
## Another method for SST
SSW + SSB

Out[29]: 202.16

In [30]:
# Mean of Squares

Ms_between = SSB/df_between

Ms_within = SSW/df_within

Ms_total = SST/df_total

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 8/13


5/26/23, 9:52 PM Notebooks

In [31]:
f = Ms_between / Ms_within

In [32]:
import scipy.stats as stats

# F-statistic and degrees of freedom


f_statistic = f # Replace with the actual calculated F-statistic
df_between = 2
df_within = 147

# Calculate the p-value


p_value = 1 - stats.f.cdf(f_statistic, df_between, df_within)

# Print the p-value


print("p-value:", p_value)

p-value: 0.903145943262158

In [33]:
if f > 3.057620651649394:
print ("Reject the Null Hypothesis.")
else :
print ("We Fail to Reject the Null Hypothesis.")

We Fail to Reject the Null Hypothesis.

The one-way ANOVA results indicate that we fail to reject the null hypothesis. This means
that there is not enough evidence to conclude that there are significant differences
between the mean weight loss of the three diets (A, B, and C).

Q10. A company wants to know if there are any significant


differences in the average time it takes to complete a task
using three different software programs: Program A,
Program B, and Program C. They randomly assign 30
employees to one of the programs and record the time it
takes each employee to complete the task. Conduct a two-
way ANOVA using Python to determine if there are any main
effects or interaction effects between the software programs
and employee experience level (novice vs. experienced).
Report the F-statistics and p-values, and interpret the results.
In [34]:
import pandas as pd

# Arrays for time, program, and experience


time = [12, 15, 18, 14, 16, 13, 11, 17, 20, 19, 14, 16, 13, 12, 15, 18, 14, 16, 1
program = ['A', 'B', 'C'] * 10
experience = ['Novice', 'Experienced'] * 15

# Create the DataFrame


data = pd.DataFrame({'Time': time, 'Program': program, 'Experience': experience})

# Display the DataFrame


data.head()

Out[34]: Time Program Experience

0 12 A Novice
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 9/13
5/26/23, 9:52 PM Notebooks

Time Program Experience

1 15 B Experienced

2 18 C Novice

3 14 A Experienced

4 16 B Novice

In [35]:
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [36]:
model = ols("Time ~ Program + Experience + Program:Experience", data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

To interpret the results, focus on the p-values. A p-value less than the significance level (e.g.,
0.05) suggests that there is a significant effect.

If the p-value for the "Program" factor is significant, it indicates that there are
significant differences in the average time to complete the task among the three
software programs (A, B, and C).
If the p-value for the "Experience" factor is significant, it suggests that there are
significant differences in the average time to complete the task between novice and
experienced employees.
If the p-value for the interaction effect is significant, it implies that the effect of the
software program on the time to complete the task differs depending on the
employee's experience level.

In [37]:
print(anova_table)

# Example interpretation
if anova_table['PR(>F)']['Program'] < 0.05:
print("There is a significant difference in the average time to complete the

if anova_table['PR(>F)']['Experience'] < 0.05:


print("There is a significant difference in the average time to complete the

if anova_table['PR(>F)']['Program:Experience'] < 0.05:


print("There is a significant interaction effect between the software program

sum_sq df F PR(>F)
Program 1.866667 2.0 0.173375 0.841866
Experience 0.033333 1.0 0.006192 0.937932
Program:Experience 69.066667 2.0 6.414861 0.005863
Residual 129.200000 24.0 NaN NaN
There is a significant interaction effect between the software programs and employe
e experience level.

Q11. An educational researcher is interested in whether a new


teaching method improves student test scores. They
randomly assign 100 students to either the control group
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 10/13
5/26/23, 9:52 PM Notebooks

(traditional teaching method) or the experimental group


(new teaching method) and administer a test at the end of
the semester. Conduct a two-sample t-test using Python to
determine if there are any significant differences in test
scores between the two groups. If the results are significant,
follow up with a post-hoc test to determine which group(s)
differ significantly from each other.
In [38]:
import numpy as np
from scipy import stats

In [39]:
# Generate random test scores for the control and experimental groups
np.random.seed(42) # Set a seed for reproducibility
control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=10, size=100)

In [40]:
# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)
alpha = 0.05 # Significance level

In [41]:
print("Two-sample t-test results:")
print(f"t-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.4f}")

Two-sample t-test results:


t-statistic: -4.7547
p-value: 0.0000

In [42]:
if p_value < alpha:
print("There is a significant difference in test scores between the control a
else:
print("There is no significant difference in test scores between the control

There is a significant difference in test scores between the control and experiment
al groups.

In [43]:
df = pd.DataFrame({
'Test Scores': np.concatenate([control_scores, experimental_scores]),
'Group': np.concatenate([np.repeat('Control', len(control_scores)),
np.repeat('Experimental', len(experimental_scores))]
})

In [44]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_results = pairwise_tukeyhsd(df['Test Scores'], df['Group'])


print(tukey_results)

Multiple Comparison of Means - Tukey HSD, FWER=0.05


========================================================
group1 group2 meandiff p-adj lower upper reject
--------------------------------------------------------
Control Experimental 6.2615 0.0 3.6645 8.8585 True
--------------------------------------------------------

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 11/13


5/26/23, 9:52 PM Notebooks

Q12. A researcher wants to know if there are any significant


differences in the average daily sales of three retail stores:
Store A, Store B, and Store C. They randomly select 30 days
and record the sales for each store on those days. Conduct a
repeated measures ANOVA using Python to determine if
there are any significant differences in sales between the
three stores. If the results are significant, follow up with a
post- hoc test to determine which store(s) differ significantly
from each other.
In [45]:
import numpy as np

store_a=np.random.randint(10000,25000,30)
store_b=np.random.randint(10000,20000,30)
store_c=np.random.randint(10000,30000,30)

In [46]:
# Create a dataframe with the sales data
df = pd.DataFrame({
'Day': list(range(1, 31)) * 3, # Day numbers
'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30, # Store labels
'Sales': [120, 135, 128, 110, 115, 130, 125, 132, 122, 130, 133, 140, 120, 12
150, 142, 138, 145, 148, 135, 138, 132, 130, 128, 120, 125, 170, 17
210, 198, 202, 190, 205, 192, 158, 160, 168, 155, 150, 165, 132, 13
125, 120, 122, 118, 115, 155, 152, 148, 150, 160, 158, 240, 235, 23
172, 180, 185, 188, 195, 168, 175, 172, 170, 162, 158, 198, 205, 21
})

In [47]:
print(df.head())

Day Store Sales


0 1 A 120
1 2 A 135
2 3 A 128
3 4 A 110
4 5 A 115

In [48]:
import statsmodels.api as sm

In [49]:
from statsmodels.formula.api import mixedlm

In [50]:
df['Store'] = df['Store'].astype('category')
df['Day'] = df['Day'].astype('category')

# Fit the repeated measures ANOVA model


model = mixedlm('Sales ~ Store', data=df, groups=df['Day'])
result = model.fit()

# Print the ANOVA table


print(result.summary())

Mixed Linear Model Regression Results


========================================================
Model: MixedLM Dependent Variable: Sales
No. Observations: 90 Method: REML
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 12/13
5/26/23, 9:52 PM Notebooks

No. Groups: 30 Scale: 708.5962


Min. group size: 3 Log-Likelihood: -415.4107
Max. group size: 3 Converged: Yes
Mean group size: 3.0
--------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
--------------------------------------------------------
Intercept 129.400 4.939 26.200 0.000 119.720 139.080
Store[T.B] 29.033 6.873 4.224 0.000 15.562 42.504
Store[T.C] 63.033 6.873 9.171 0.000 49.562 76.504
Group Var 23.196 3.134
========================================================

Conclusion: results suggest that store type has a significant impact on


sales. Stores of type T.B and T.C have significantly higher sales than the
average store.

The End

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 6.ipynb 13/13


5/26/23, 9:52 PM Notebooks

Statistics Advance Assignment 7


Q1. Write a Python function that takes in two arrays of data
and calculates the F-value for a variance ratio test. The
function should return the F-value and the corresponding p-
value for the test.
In [1]:
import numpy as np
import scipy.stats as stat
from scipy.stats import f

def test (data1, data2):


f_test = np.var(data1)/np.var( data2)

df1=len(data1)-1
df2=len(data2)-1
p_value = f.sf(f_test, df1, df2)

print("f_test:", f_test, "p-value:", p_value)

In [2]:
test([10,20,34,56,78,64,98,56,53,75],[90,87,64,34,12,56,77,45])

f_test: 1.0224395952700231 p-value: 0.5000008288542184

Q2. Given a significance level of 0.05 and the degrees of


freedom for the numerator and denominator of an F-
distribution, write a Python function that returns the critical
F-value for a two-tailed test.
In [3]:
def test2 (df_num,df_den):
alpha = 0.05
critical_value = stat.f.ppf(1-alpha, dfn=df_num, dfd=df_den)
return critical_value

In [4]:
test2(23,29)

Out[4]: 1.9102874554747564

Q3. Write a Python program that generates random samples


from two normal distributions with known variances and
uses an F-test to determine if the variances are equal. The
program should output the F-value, degrees of freedom, and
p-value for the test.
In [5]:
import scipy.stats as stat
from scipy.stats import f
def test3(sample1, sample2) :

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 7.ipynb 1/5


5/26/23, 9:52 PM Notebooks

f_value = np.var(sample1)/ np.var (sample2)

df1 = len(sample1 ) - 1
df2 = len(sample2 ) - 1

p_value = f.sf( f_value, df1, df2 )

print ("f-value:", f_value ," df1:", df1, " df2: ", df2, " p-value: ", p_v

In [6]:
import numpy as np
sample1 = np.random.normal(size=30)
sample2 = np.random.normal(size=50)
test3(sample1, sample2)

f-value: 0.5820502838568864 df1: 29 df2: 49 p-value: 0.9393497806240769

Q4.The variances of two populations are known to be 10 and


15. A sample of 12 observations is taken from each
population. Conduct an F-test at the 5% significance level to
determine if the variances are significantly different.
In [7]:
var1=10
var2 = 15
df1= 12-1
df2= 12-1
alpha = 0.05

f_value = var1/ var2


critical_value=stat.f.ppf(q=1-alpha, dfn=df1, dfd=df2)

if f_value > critical_value:


print("There is a Significant difference.")
else :
print("There is no Significant difference.")

There is no Significant difference.

Q5. A manufacturer claims that the variance of the diameter


of a certain product is 0.005. A sample of 25 products is
taken, and the sample variance is found to be 0.006. Conduct
an F-test at the 1% significance level to determine if the claim
is justified.
In [8]:
# data from the question is as follows:
assumed_var = 0.005
sample_var = 0.006
n = 25
dfn = n-1
dfd = n-1
alpha = 0.01

import scipy.stats as stat

f_value = assumed_var/ sample_var


critical_value = stat.f.ppf (q = 1-alpha, dfn=dfn, dfd=dfd)

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 7.ipynb 2/5


5/26/23, 9:52 PM Notebooks

print("f_value:", f_value," critical_value:", critical_value)

# conclusion
if f_value > critical_value:
print("There is a Significant difference.")
else :
print("There is no Significant difference.")

f_value: 0.8333333333333334 critical_value: 2.659072104348157


There is no Significant difference.

Q6. Write a Python function that takes in the degrees of


freedom for the numerator and denominator of an F-
distribution and calculates the mean and variance of the
distribution. The function should return the mean and
variance as a tuple.
In [9]:
def test4 (dfn, dfd):

# calculate mean
if dfd > 2:
mean = dfd / (dfd - 2)
else:
mean = float('inf')

# calculate variance
if dfd> 4:
variance = (2 * (dfd** 2) * (dfn+ dfd- 2)) / (dfn* (dfd- 2) ** 2 * (
elif dfd<= 4 and dfd> 2:
variance = float('inf')
else:
variance = float('nan')

print ("mean=",(mean)," variance=",(variance))

In [10]:
test4(29,49)

mean= 1.0425531914893618 variance= 0.12659877998227384

Q7. A random sample of 10 measurements is taken from a


normal population with unknown variance. The sample
variance is found to be 25. Another random sample of 15
measurements is taken from another normal population with
unknown variance, and the sample variance is found to be
20. Conduct an F-test at the 10% significance level to
determine if the variances are significantly different.
In [11]:
# data from question
n1= 10
sample_variance1 = 25

n2= 15
sample_variance2 = 20
https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 7.ipynb 3/5
5/26/23, 9:52 PM Notebooks

#setting degrees of freedom


dfn = n1-1
dfd = n2-1

alpha = 0.10

#Performing tests

import scipy.stats as stat

f_value = sample_variance1 / sample_variance2


critical_value = stat.f.ppf (q = 1-alpha, dfn=dfn, dfd=dfd)

print("f_value:", f_value," critical_value:", critical_value)

# conclusion
if f_value > critical_value or f_value < (1 / critical_value):
print("There is a Significant difference.")
else :
print("There is no Significant difference.")

f_value: 1.25 critical_value: 2.121954566976902


There is no Significant difference.

Q8. The following data represent the waiting times in


minutes at two different restaurants on a Saturday night:
Restaurant A: 24, 25, 28, 23, 22, 20, 27; Restaurant B: 31, 33,
35, 30, 32, 36. Conduct an F-test at the 5% significance level
to determine if the variances are significantly different.
In [12]:
# data from question
Restaurant_A=[ 24, 25, 28, 23, 22, 20, 27]
Restaurant_B=[ 31, 33, 35, 30, 32, 36]
alpha = 0.05

dfn = len(Restaurant_A)-1
dfd = len(Restaurant_B)-1

In [13]:
import numpy as np
import scipy.stats as stat
from scipy.stats import f

f_value = np.var(Restaurant_A)/ np.var(Restaurant_B)


critical_value = stat.f.ppf(q= 1-alpha,dfn=dfn, dfd=dfd)

print("f_value:", f_value," critical_value:", critical_value)

# conclusion
if f_value > critical_value or f_value < (1 / critical_value):
print("There is a Significant difference.")
else :
print("There is no Significant difference.")

f_value: 1.496767651159843 critical_value: 4.950288068694318


There is no Significant difference.

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 7.ipynb 4/5


5/26/23, 9:52 PM Notebooks

Q9. The following data represent the test scores of two


groups of students: Group A: 80, 85, 90, 92, 87, 83; Group B:
75, 78, 82, 79, 81, 84. Conduct an F-test at the 1% significance
level to determine if the variances are significantly different.
In [14]:
Group_A = [80, 85, 90, 92, 87, 83]
Group_B = [75, 78, 82, 79, 81, 84]
df1 = len(Group_A)-1
df2 = len(Group_B)-1

alpha = 0.01

#calculations
f_value = np.var(Group_A) / np.var(Group_B)
critical_value = stat.f.ppf(q=1-alpha, dfn=df1, dfd=df2)

print("f_value:", f_value," critical_value:", critical_value)

# conclusion
if f_value > critical_value or f_value < (1 / critical_value):
print("There is a Significant difference.")
else :
print("There is no Significant difference.")

f_value: 1.9442622950819677 critical_value: 10.967020650907992


There is no Significant difference.

The End

https://github.com/Mahenoor-Merchant/Statistics/blob/main/Statistics Advance Assignment 7.ipynb 5/5

You might also like