Assignment 1
Assignment 1
Assignment 1
Assignment 1
Instructions:
1. This document uses HousePrices.jmp dataset which is available on the course website alongside
this homework.
2. Each week’s assignment requires you to perform some data analysis using JMP and turn in a
brief report of no more than three pages. A course in statistics is incomplete without applying
the learned ideas on datasets. JMP is a relatively friendly software to play around with, and
communicating your observations and results are an essential part of the course learning.
3. Please remember to include your name, section, and PGID at the top of the assignment. Also
ensure that the submission filename includes your PGID followed by name.
4. The homework submissions will not be graded for correctness, but rather for effort. The
scoring will be binary – you will get full credit if your effort is deemed satisfactory, otherwise
you will get no credit. No credit will be given for late submissions. A timely submission that
responds to all questions and shows your thinking will be considered satisfactory whether or
not your solution is correct.
5. Honor code category 2N-b applies. Please ensure that the submitted write-up is entirely your
own work. Significant overlaps with other submissions maybe considered as possible
instances of violation of the honor code.
6. All assignments are individual assignments and each assignment is worth 5 points.
Have fun!!
Description of the dataset:
A real estate agent is trying to understand the nature of housing stock and home prices in
and around a medium sized town in upstate New York. She has collected data from a
random sample of 1047 homes sold in the last 12 months. Data was collected on the
following variables and is available in the attached HousePrices.jmp file.
• Price – the sale price of the house in $
• Living Area – in Sq. ft.
• Bathrooms –number of bathrooms in the house (powder rooms with no tub or
shower area are considered 0.5 baths)
• Bedrooms – the number of bedrooms
• Lot Size – size of the property on which the house sits (in acres).
• Age – of the house in years
• Fireplace – whether or not the house has a fireplace (Yes = 1, No = 0)
Your task is to analyze this dataset in order to gain some understanding of this particular
real estate market – the values of homes, their characteristics in terms of size and other
features, and relationships between these. This understanding will prove immensely
helpful to the real estate agent in advising her clients. Since all of the homes are from the
same geographical area, location (which usually has a huge bearing on home values) is not
a major concern here.
Most of the analysis will be done in response to the specific questions posed on the
homework assignments. But feel free to explore and play around with the data set to
enhance your own understanding of how to make sense of data.
Assignment 1
1. Prepare a brief report summarizing the home values (prices) in this area. Use both
graphical and numerical summaries. Your report should briefly describe what those
summaries tell you, and anything of particular note/interest.
2. Does the normal model provide a good description of the prices? Use a Normal Quantile
plot to frame your response.
3. Irrespective of your response to Q2, assume that Price ~ N(164K, (68K)2). Given this:
A. Calculate the following probabilities – P(Price > 92.8K), P(Price < 255.5K). Do
these numbers agree with what you see in the data?
B. Once again, assuming the above normal distribution, what percentage of houses
should have a value less than 232K? Does that agree with the data?
C. Based on the theoretical model, what do you expect should be the price of a
house that is exactly on the 3rd quartile (75th percentile,). How does that
compare to the actual?
4. Create a histogram and boxplot for the Living Area variable. Is the distribution
symmetric? Check the skewness measure to see if it is consistent with your
observation.
5. Create a new column in the dataset by taking the logarithm of the Living Area variable.
Is the normal distribution a better fit for this variable or the original (Living Area)
variable? Why do you think this is the case?