Skittlesdataclassproject

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

McCall Martin

Math 1040: Statistics


Skittles Class Project
Spring 2018

Introduction

The following data has been recorded by everyone in my Spring 2018 Statistics class. We each
bought a 2.17 ounce bag of Skittles and recorded the frequency of each color inside our bags. Each
classmate sent their information to our instructor, who then recorded the data and sent out a copy of
the data to the class. The goal of this project is to allow us to practice the things we are learning in class
and be able to get a hands-on perspective of collecting, organizing and drawing conclusions about our
data. The following pie and pareto charts are showing an accumulation of how many red, orange,
yellow, green and purple candies each student in the class had in their bags compared to my own bag of
Skittles. These graphs display the qualitative data in this data set between the Statistics class. The
quantitative data graphs follow the pie and pareto charts. These include: a table showing the central
tendency, a histogram and a box plot. The quantitative data is displaying the counts of candy in each
bag of Skittles.

Organizing and Displaying Categorical Data: Colors


Class Totals:
McCall's Bag:
While observing the class data, I noticed that the totals are all very similar to each other. There
isn't much of an overall difference. Yellow seemed to be the candy color that had the lowest total, which
happened to be the same for my individual bag of candy. Yellow was my lowest candy color and that
agrees with the rest of the class data. All my reds, oranges, greens and purples were the same. For the
class data, those colors were all close to each other but still had a bit of a difference. I figured that some
bags would have a larger range of frequency between candy colors and expected to see more of a
difference between number of colors.

Organizing and Displaying Quantitative Data: The Number of Candies per Bag

Quantitative Statistics on Number of Candies Per Bag

Column n Mean Std. Min Q1 Media Q3 Max


dev. n
Total 34 60.00 3.38 45 59 61 62 64
Candy/Ba
g
I was a little surprised while I was observing the histogram and boxplot of the data for the 34
different bags in the sample. I was really shocked to see an outlier (45) that was so different from the
rest of the data. Having the outlier helped the overall graphs become skewed left. It was very interesting
to observe this data and I was a little surprised with the different totals in each bag. In my bag I had a
total number of 59 candies, which was very close to the median. Being so close to the median, I would
say that my number of candies was in agreeance to the rest of the class's data.

Reflection

I personally feel the graphs for the quantitative and qualitative data were extremely helpful in
creating a visual to understand and see the different categories and their data sets. Categorical data,
also known as qualitative data, uses descriptions and labels. For this project, our categorical data was
the different colors of the Skittles candy. Graphs like pie charts and pareto charts make sense to help
label and organize categorical data. A pie chart shows the relative frequencies, or percentages, that
make up the entire data set. In other words, pie charts help compare each category against the whole.
Bar graphs and pareto charts show the categories on the x-axis along with the frequency of each
category on the y-axis. When dealing with qualitative data and making pie charts/pareto charts, we use
calculations to determine frequencies as well as relative frequencies. Averages, standard deviations and
other central tendency measurements wouldn't make sense because you can't average categorical data.
Qualitative data deals more with counts and frequencies to determine percentages. The different
categories can also be labeled and colored in a nice, easy-to-read way that helps the viewer distinguish
the differences between the different categories. Graphs like boxplots or histograms don’t make sense
to show categorical data because they have a lot more to do with numbers and make more sense for
quantitative data.
Histograms are a nice way to organize quantitative data because each rectangle determines a
class which is then compared to the frequency on the y-axis. They show the patterns in a data set. Other
graphs that could be used for quantitative data would be a stem and leaf plot, dot plot and box plot.
Although we didn't use a stem and leaf plot or dot plot for this particular project, they are another way
to show the frequencies in a data set as well as the class width. A box plot is probably my favorite way
to present quantitative data. Box plots are set up in a way that shows the inter quartile range and the
outliers in the data set. Box plots show a 5-number summary which entails a minimum, Q1, median, Q3
and maximum number. These graphs don't make sense for qualitative data because they have a lot
more to do with class widths and numerical data that can be counted, added and subtracted to show
the numbers have meaning. Quantitative data is where you can find averages, standard deviations and
other numerical measurements which is helpful in statistics. Having these measurements and numbers
is helpful for people to be able to read and understand data in a more specific but understandable way.

Confidence Interval Estimates

Confidence intervals help determine how many times an estimated value will be accurate.
When there is an estimated value, having a certain percentage of confidence helps determine the
certainty of your value. There is a higher chance to be accurate between a range of numbers than there
would be if you were trying to be exactly precise. The percentage of confidence is basically the
boundaries of where something occurs, between the lower and upper bounds. Anything outside of the
confidence level is called alpha, which is where a value would occur if it was uncertain or even wrong.
Confidence intervals are a way to determine the certainty or uncertainty of a value. They're a great way
to test an estimated value and see how accurate it may be.

Here, I constructed a 99% confidence interval estimate for the true proportion of yellow candies
throughout the entire class.
99% confidence interval results for the true proportion of yellow candies:
Sample X N Sample p 99% CI

1 388 2040 0.1902 (0.1678, 0.2126)

I then constructed a 95% confidence interval estimate for the true mean number of candies per bag.
95% confidence interval results for the true mean of candies per bag:

Sample x̅ N DF Sx̅ E 95% CI

1 60.00 34 33 0.58 1.18 (58.82, 61.18)

I am 99% confident that the true proportion of yellow, 0.1902, falls in between the lower bound:
0.1678 and the upper bound: 0.2126.
I am 95% confident that the true mean, 60.00, lies between the lower bound: 58.82 and the
upper bound: 61.18.

Hypothesis Tests

Hypothesis testing is an extremely useful way for the world to determine whether the original
hypothesis should be kept or rejected. A hypothesis in statistics is a statement regarding a characteristic
of a parameter. During a hypothesis test you start with the original hypothesis, or null hypothesis, then
draw up an alternative hypothesis. The test is where you determine whether you should reject the null
hypothesis or keep it. Hypothesis testing is used all throughout the world in everyday lives. They are
extremely important in determining if the evidence and data found support the null hypothesis or not.
Hypothesis testing is used in important matters, for example: legal decisions, scientific evidence, etc.
Because it's used so often in important matters, it has become extremely useful in determining
outcomes based on evidence and the null hypothesis.

Use a 0.05 significance level to test the claim that 20% of all Skittles candies are red.
Test and CI for One Proportion (Red Skittles):
Ho: p = 0.2

H1: p ≠ 0.2

Sample X N Sample p 95% CI Zo

1 425 2040 0.208 (0.1826, 0.2174) 0.941

I used the classical approach for this hypothesis test. The classical approach compares the
critical z score to the test statistic (Zo). 0.2 is included in the confidence interval and acts as the null
hypothesis. After making calculations, there was sufficient evidence that the null hypothesis should not
be rejected. The test statistic, 0.941, was greater than the negative critical z value of –1.96 and less than
the positive critical z score of 1.96. Because of this, we know to keep the null hypothesis, therefore not
rejecting it.

Use a 0.01 significance level to test the claim that the mean number of candies in a bag of Skittles is 55.
One-Sample T: Candies Per Bag:
Test of Candy/Bag is 55.0
Ho: μ = 55

H1: μ ≠ 55
Variable N x̅ Sx̅ SE Mean To

Candy/Bag 34 60.00 3.38 0.58 8.621

I used the classical approach for this hypothesis test as well. There was sufficient evidence to
reject the null hypothesis of μ = 55. It was rejected because test statistic, 8.621, was greater than the
area of alpha which was 2.73.
Reflection

When completing a confidence interval, it is important to first make sure that your distribution
is normal, meaning your sample size must be greater than or equal to 30 when working with a mean
problem. When dealing with a proportion problem you must make sure your sample size, n, multiplied
by your percentage, p, multiplied by (p-1) is greater than or equal to 10; in other words, np(1-p)≥ 10.
Once you have a normal distribution you can start the process of creating a confidence interval. Once
the level of confidence has been determined, the rest can be calculated. There needs to be a sample
size, alpha and in most cases a standard deviation. There is a slight difference in finding the confidence
intervals between a mean and a proportion problem. Once you have all your data and find your critical z
you can determine the lower and upper boundaries, essentially solving your problem. When completing
my confidence intervals for the Skittles data, my samples met the conditions needed to complete the
confidence interval.
Potential fatal errors that could occur during a confidence interval would be not checking to
make sure your sample size was large enough to be able to complete the problem, as well as if you had
the wrong data or made a mistake in calculating and using certain formulas. Also, a big error that could
have occurred is differentiating between when to use a critical z score or a critical t score.
In order for a hypothesis test to function correctly it is extremely important that you have a
value for alpha. Without alpha there is no way a hypothesis test can be completed. Alpha is how you
know whether to reject your null hypothesis. During a hypothesis test you always have your null and
your alternative hypothesis which is the first step to completing a test, aside from making sure your
sample sizes are large enough to solve the problem. Once you have both your null and alternative
hypothesis, it is important to have a point estimate or a sample from your population. Once you have all
the data needed to complete your test and you've figured out either your critical z or critical t from
alpha, you can determine whether to reject your hypothesis. There are two different ways in
determining if you can reject your hypothesis or not. The first is the classical approach, comparing your
test statistic to your critical z or critical t score. The second approach is the P-Value approach, where the
sample is used and calculated its extremes and compared to the alpha. Another approach would be
testing with a confidence interval, determining the lower and upper bounds and seeing if your test
statistic falls in between the lower and upper bounds.
My samples and data met the conditions needed for a hypothesis test. I had a value of alpha in
each test and was able to determine if I should reject or fail to reject the null hypothesis. There are two
types of errors in a hypothesis test, type 1, and type 2. Type 1 errors happen when you reject the null
when you should have kept it. A type 2 error is when you fail to reject the null when you should have
rejected it. Both errors can potentially cause problems when made in the real world. If I did a problem
wrong and rejected the null or failed to reject the null when I shouldn't have, that would lead to a
tremendous error.
For this particular project, an error could have occurred in a few different ways. For example, if
students miscalculated or didn't record the right data from their Skittles bag that could have thrown off
the data quite a bit. If a student received Skittles that were deformed, and they weren't sure whether
they should count that as a full candy, that could have also thrown off the data.
The sampling method could have been improved if we used a larger sample size, in terms of
getting a larger Skittles bag for each student instead of the 2.17oz size. When increasing a sample size,
the randomness also increases. If we had a larger sample size, we could have seen larger numbers of
each color per bag; thus, seeing more of a difference between the colors in the bags. It would have
been interesting to see the results from a larger sample.
I think this was a stupendous project and really helped me understand the things I've been
learning in my statistics class with a hands-on approach. I loved being able to do a confidence interval
test and a hypothesis test with real information and data gathered that I was a part of. It was
interesting to be able to observe everything unfold and perform a confidence interval test as well as a
hypothesis test. I've observed that the data from the class seems to be about accurate when
determining these tests. The tests performed and the data gathered throughout the project are
showing how there are ways of looking at life differently.

You might also like