FIT1006Asst2 vn0b

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Monash University Faculty of Information Technology 1st Semester 2024

FIT1006 Business Information Analysis

Assignment 2: Further Data Analysis, Estimation and Hypothesis Testing

This assignment is worth 40% of your final mark (subject to the hurdles described in the
FIT1006 handbook entry, FIT1006 Moodle preview [or Unit Guide] and links therein).
Among other things (see below), note the need to hit the `Submit’ button (and the possible
requirement of an interview).

Due Date: Wednesday 15th May 2024, 11:55 pm


Method of submission: Your submission should consist of 1 file:
1. A text-based .pdf file named as: FamilyName-StudentId-1stSem2024FIT1006Asst2.pdf .
The file must be uploaded on the FIT1006 Moodle site by the due date and time.
The text-based .pdf file will undergo a similarity check by Turnitin at the time you submit
to Moodle. If you have any relevant output from MicroSoft Excel and/or from SYSTAT then
make sure to include that in the appropriate place(s) in your .pdf file.
Please read submission instructions here and elsewhere carefully regarding the use of
Moodle.

Total available marks: 28 + 10 + 7 + 25 + 30 = 100 marks.

Note 1: Please recall support, conferring with https://www.monash.edu/student-academic-


success, the Academic Integrity rules and the `Welcome to FIT1006’ post in Ed Discussion.
This is an individual assignment.
In submitting this assignment, you acknowledge both that you are familiar with the relevant
policies, rules and regulations regarding Academic Integrity (including, e.g., doing your own
work, not sharing your work, not using ChatGPT in particular, not using generative AI at all)
and also that you are familiar with the consequences of being deemed to be in contravention
of these policies.

Note 2: And a reminder not to post even part of a proposed partial solution to a forum or
other public location. This includes when you are seeking clarification of a question.
If you seek clarification on an Assignment question then – bearing in mind the above – word
your question very carefully and/or (if necessary) send private e-mail. If you are seeking to
understand a concept better, then try to word your question so that it is a long way removed
from the Assignment. You are reminded that this is probably the best path to a faster and
clearer answer (in addition to consultation sessions) without (e.g.) removal of your post. You
are also reminded that Monash University takes academic integrity very seriously.

Note 3: As previously advised, it is your responsibility to be familiar with the special


consideration policies and special consideration process – as well as academic integrity.
Students should be familiar with the special consideration policies and the process for
applying. Among other things, the FIT1006 teaching team can not process your special
consideration application, is not able to process your special consideration application and is
not even able to forward your special consideration application on to the relevant people – so,
sending such material to the FIT1006 teaching team simply runs the risk of missing the
deadline with Monash University’s special consideration team. And, if you apply for special
consideration (via the correct process) and are granted special consideration, then please
allow the FIT1006 teaching team 2 business days to process this.

Note 4: As a general rule, don’t just give a number or an answer like `Yes’ or `No’ without
at least some clear and sufficient explanation - or, otherwise, you risk being awarded 0
marks for the relevant exercise. Make it easy for the person/people marking your work to
follow your reasoning. Without clear explanation, there is the possibility that any such
exercise will be awarded 0 marks.
Re-iterating a point above, for each and every question, sub-question and exercise, clearly
state any assumptions, clearly explain your answer and clearly show any working.

Note 5: All of your submitted work should be in machine readable form, and none of your
submitted work should be hand-written.

Note 6: If you wish for your work to be marked and not to accrue (possibly considerable)
late penalties, then make sure to upload the correct files and (not to leave your files
as Draft). You then need to determine whether you have all files uploaded and that you are
ready to hit `Submit’. Once you hit `Submit’, you give consent for us to begin marking your
work. If you hit `Submit’ without all files uploaded then you will probably be deemed not to
have followed the instructions from the Notes above. If you leave your work as Draft and
have not hit `Submit’ then we have not received it, and it can accrue late penalties once the
deadline passes. In short, make sure to hit ‘Submit’ at the appropriate time to make sure that
your work is submitted. Late penalties will be as per Monash University Faculty of IT and
Monash University policies (see, e.g.,
https://publicpolicydms.monash.edu/Monash/documents/1935752) and, e.g., sec. 1.11). It is
expected that any work submitted at least 10 calendar days after the deadline will
automatically be given a mark of 0.

Note 7: Save your work regularly.

Some Questions and Answers – further to the above


What help am I entitled to have with this assignment?
Academic integrity is an important concern. As such, you must write your work yourself,
without collaborating with other students nor anyone else – nor using generative AI (e.g.,
ChatGPT). This includes doing your own reading of any references.

Are there any other matters that relate to academic integrity?


Yes. You must be honest in reporting the results.
Introduction
There are many data-sets which are collected by whatever means, and there are many
ways to analyse these. Some such data-sets are at or are available via (e.g.)
Kaggle Datasets https://www.kaggle.com/datasets (e.g.,
https://www.kaggle.com/datasets/vikramxd/amazon-business-research-analyst-dataset and
https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023 and
https://www.kaggle.com/datasets/omargowaily/aussie-vehicle-tax-exemptions-check-
eligibility), https://GreatReefCensus.org,
https://www.nyTimes.com/2024/04/17/science/colorado-brain-data-privacy.html,
e-mails, phone calls and reports to the Monash University Safer Community Unit
https://www.monash.edu/students/support/safety-security/safer-community-unit (SCU),
student web site visits and appointments with Monash University's
https://www.monash.edu/student-academic-success Student Academic Success.
At least many of these data-sets mentioned above are directly from business, commerce and
industry. At least many of these data-sets also have a commercial motivation and/or
application - e.g., ecotourism, having the appropriate number of employees to manage
workload, etc. W S Gosset’s (or Student’s) original work on the t distribution was motivated
by his work in the brewery.

Above is an introduction.

This assignment is worth 40% of the overall mark for FIT1006.

Throughout this Assignment, recall all notes and instructions.

----
----

Qu 1 (8 + 8 + 8 + 4 = 28 marks)

Choose values of p, q and r based on the last digit of your StudentId.


If your StudentId (which appears on your Student card) is hypothetically 3987654, then the
last digit is 4, and this corresponds to the row which begins ``4 or 5’’.

Last digit of StudentId: p q r


0 or 1 0.01 0.8 0.1
2 or 3 0.02 0.9 0.1
4 or 5 0.03 0.8 0.2
6 or 7 0.04 0.7 0.3
8 or 9 0.05 0.75 0.25

Throughout this and all questions, clearly state any assumptions, show all working and
explain all your reasoning.

A company is interested in estimating how many of its employees are fit for work. To
simplify the question, we assume that each test costs no money and is free.

The probability of having a certain medical condition C1 is p throughout the population.

For people with C1, if they do test T1 then the probability of a positive test result is q.
For people without C1, if they do test T1 then the probability of a positive test result is r.

a) If someone tests positive for C1 from test T1 then what is the probability that they have
C1?

b) If someone tests negative for C1 from test T1 then what is the probability that they have
C1?

We now modify matters so that some people who test positive for C1 from test T1 are
permitted to do test T1 again.
As before, for people with C1, if they do test T1 then the probability of a positive test result is
q.
As before, for people without C1, if they do test T1 then the probability of a positive test
result is r.
c) If someone tested positive in the first test T1 for C1 and now tests positive again for C1
from test T1 then what is the probability that they have C1?

d) If someone tested positive in the first test T1 for C1 but now tests negative for C1 from
test T1 then what is the probability that they have C1?

----

Qu 2 (6 + 4 = 10 marks)

Throughout this and all questions, clearly state any assumptions, show all working and
explain all your reasoning.

We present below some artificial data about the


https://www.monash.edu/students/support/safety-security/safer-community-unit Monash
University Safer Community Unit (SCU)’s time taken for dealing with complaints and
reports. We emphasise that this data is artificial. (We could equally well have made up
artificial data about Monash University's https://www.monash.edu/student-academic-success
Student Academic Success.) Analysis of such data could assist with, e.g., staffing levels.

(a) SCU received approx. 40 reports in February and 60 reports in March.


Let us assert that February had 18 online reports, 15 e-mail reports and 7 phone
reports (or 18 online and 22 not online). Let us assert that March had 28 online
reports, 18 e-mail reports and 14 phone reports (or 28 online and 32 not online).

Prior to learning these 100 data points, someone hypothesised that half the reports to the SCU
are online - or, more specifically, that the difference from 0.5 is not statistically significant.

Clearly stating any assumptions and showing all working, what is your opinion of this
hypothesis?

(b) In this artificial data, 85% of reports to SCU are resolved within 40 working days.
Assume that the rate at which reports are resolved is related to a Poisson distribution (with a
certain rate per day). Putting the first sentence to this question another way, assume that
there is a probability of 0.85 that a report will have resolved - and reach resolution - in 40
days. Estimate the Poisson rate at which reports are resolved.

P.S. to Question 2: While the SCU and Student Academic Success do indeed both exist (and
with above web links), we again emphasise that the above data is artificial.
----
----

Qu 3 (7 marks)

The Reserve Bank of Australia (RBA) keeps all sorts of data, some of which is on exchange
rates https://www.RBA.gov.au/statistics/historical-data.html#exchange-rates. Monash
University has several campuses in Australia and also several campuses outside of Australia,
including Monash Malaysia. There is active interaction, collaboration and travel between
these campuses. In order to make plans, various people have accessed
https://www.RBA.gov.au/statistics/tables/xls-hist/2023-current.xls and column N, concerning
MYR (the Malaysian ringgit). Someone has looked at the data from 17-Apr-2023 (where the
exchange rate is 2.9674) to (252 rows and 1 year later) 17-Apr-2024 (where the exchange rate
is 3.0714). They have then taken the daily differences. For example, the 1st daily difference
(value at 18-Apr-2023 minus value at 17-Apr-2023) is 2.9884 - 2.9674 = 0.0210. And, for
example, the last daily difference is (value at 17-Apr-2024 minus value at 16-Apr-2024)
3.0714 - 3.0806 = -0.0092.
We only consider days which have matching rows in the spreadsheet for column N and MYR
(Malaysian ringgit) vs Australian dollar (AUD).

Based on this data, someone has hypothesised that, during this period, the daily difference
between the Australian dollar (AUD) and the Malaysian ringgit (MYR) has been zero - or,
more specifically, that the difference from 0 is not statistically significant.

Clearly stating any assumptions and showing all working, what is your opinion of this
hypothesis?

----
----
Qu 4 (7 + 7 + 5 + 3 + 3 = 25 marks)

Much money is spent by various companies (at least one of which is high profile) and other
organisations (e.g., NASA, etc.) on space exploration.

The Challenger Space Shuttle Disaster was mentioned in Lecture 4, approximately slides
118-120. Seven astronauts died within approximately 73 seconds of launch. Relevant data
(even if perhaps seemingly different in places from that given in Lecture 4) is given (not at
http://wps.aw.com/wps/media/objects/15/15719/projects/ch5_challenger/index.html but
rather) at https://www.archive.ics.uci.edu/dataset/92/challenger+usa+space+shuttle+o+ring.
Following DOWNLOAD (in the top right-hand corner) at that
https://www.archive.ics.uci.edu/dataset/92/challenger+usa+space+shuttle+o+ring link, we
obtain o-ring-erosion.names and o-ring-erosion-only.data (and we will ignore the other files).
In describing the contents of the file o-ring-erosion-only.data, the file o-ring-erosion.names
mentions
``6. Number of Attributes: 5
1. Number of O-rings at risk on a given flight
2. Number experiencing thermal distress
3. Launch temperature (degrees F)
4. Leak-check pressure (psi)
5. Temporal order of flight’’

Throughout this and all questions, clearly state any assumptions, show all working and
explain all your reasoning.

(a) Split the data from o-ring-erosion-only.data into two roughly equal-sized groups,
based on the launch temperature (degrees F, where we recall that 32 degrees F is the
temperature at which water freezes, equivalent to 0 degrees C).

For these two groups, consider (attribute 2) the number of O-rings experiencing thermal
distress. Clearly stating any assumptions, make a case as to whether or not the two groups
come from the same distribution for attribute 2.

(b) For the two groups from part (a), consider (attribute 4) the leak-check
pressure. Clearly stating any assumptions, make a case as to whether or not the two
groups come from the same distribution for attribute 4.
(c) Split the data from o-ring-erosion-only.data into two roughly equal-sized groups (as
much as is possible), based on (attribute 4) the leak-check pressure.

For these two groups, consider (attribute 2) the number of O-rings experiencing thermal
distress. Clearly stating any assumptions, make a case as to whether or not the two groups
come from the same distribution for attribute 2.

(d) Let us return to part (a). On the ill-fated launch day of 28/January/1986, the
temperature was below freezing (see, e.g., Wikipedia). Based on your analysis in part
(a), how many O-rings would you expect to experience – or have experienced -
thermal distress on 28/January/1986?

(e) Let us return now to part (b). As above, on the ill-fated launch day of
28/January/1986, the temperature was below freezing (see, e.g., Wikipedia). Based
on your analysis in part (b), what would you expect to be – or to have been - the leak-
check pressure on 28/January/1986?

----
—-
Qu 5 (5 + 5 + 5 + 5 + 10 = 30 marks)

This question continues from your Assignment 1. You are required to use your data from
Assignment 1. To make it easier for your marker, include the relevant data from (e.g.) your
Assignment 1 Qu 2.

For many world events, we wish to estimate the probability as accurately as possible. This
can often be done by an approach called the wisdom of the crowd, in which we combine the
opinions of various people to try to arrive at a combined weighted average opinion. The
various parts to this question concern the best way to combine models of probabilities (in this
case, coming from two different entrants) to get a more reliable probability.

Throughout this and all questions, clearly state any assumptions, show all working and
explain all your reasoning.

For your Player1 and Player2 from Assignment 1, recalling your Assignment 1 Qu 2-5,

(a) calculate the Q-correlation on their `This Round' scores.

(b) calculate the Pearson correlation coefficient on their `This Round' scores.

For your Player1 and Player2 from Assignment 1, recalling your Assignment 1 Qu 6-8,
consider the allocated team - Melbourne, Geelong or Richmond, and the associated p.
For clarity and the avoidance of doubt, the term p being used here is not the p-value from
classical hypothesis testing. Rather, it is the probability chosen by the competition entrant,
as in Assignment 1 Qu 6-8.

(c) calculate the Q-correlation between Player1 and Player2 on their values of p

(d) calculate the Pearson correlation coefficient between Player1 and Player2 on their
values of p.

This question continues with part (e) on the next page.


(e) Find a suitable hypothesis test involving an earlier part of this question, and then
suitably analyse and test the hypothesis.

For this hypothesis test in part (e), try to choose a non-trivial hypothesis involving Player1
(and possibly also involving Player2) that seems - at least on the surface - at least partly
plausible.
This could concern (e.g.) some property of Player1 from Assignment 1 Qu 2-3 and/or Qu 6,
this could concern (e.g.) from Assignment 1 Qu 2-5 and/or Qu 6-8, the difference in values
between some property of Player1 and the values of that property of Player2.

A possible hypothesis might be that (e.g.)


(i) Player1 has a mean score of `This Round' of 1.0 bits
or (ii) Player1 and Player2 have the same mean score of `This Round'
or (iii) Player1 has a greater mean average score than Player2 for `This Round'.
The above possibilities (i), (ii), (iii) would follow from Assignment 1 Qu 2-5.
The possibilities (iv), (v) below would follow from Assignment 1 Qu 6-8.
A possible hypothesis might be that (e.g.)
(iv) Player1 chooses Melbourne with an average probability of 0.55
or (v) there is no statistically significant difference the distribution of probabilities that
Player1 and Player2 choose for Melbourne.

Above at points (i), (ii), (iii), (iv) and (v) are some examples of the sort of hypothesis that you
might consider for part (e) of this question.

Again, you are asked to find a suitable non-trivial hypothesis involving an earlier part of this
question, and then suitably analyse and test the hypothesis.

Marks will be given here and throughout for clear explanation based on your documentation.

----

Please recall and carefully re-read all notes and instructions at the start of the Assignment and
throughout the Assignment.

END OF FIT1006 ASSIGNMENT 2 (Monash University, 1st semester 2024)

You might also like