Projectreport

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

ECS 132 Term Project

Alexander Kramer Jatin Mohanty Davis Wong Rahul Udatha

Problem A: Poster Analysis


Jatin Mohanty
913879483, jamohanty

The poster chosen here was found in the Academic Surge. The professor associated with
this poster is Dr. Pete Klimley of the Department of of Wildlife, Fish, and Conservation
Biology, Ag. and Environ. Sciences

This research poster was conducted by Dr. Pete Klimely of the Department of Wildlife,
Fish, and Conservation Biology, Ag. and Environ. Sciences. The study compares the iden-
tification accuracy of humans versus the DARWIN software at matching sharks dorsal fins
to the shark with the end goal of providing a reliable population estimate of sharks.

In this poster, a p-value of p=0.011 was used to justify the human identification over the
software recognition. A confidence interval might be more useful in this analysis in show-
ing the range of correctness of the manual method versus the software method. In addition,

1
the confidence interval is more of a relative metric to the number of samples where as since
we are not given the number of samples on the poster, we don’t know it’s impact on the
p-value. By only relying on p-values, data could be manipulated via p-hacking to get the
desired results. In the same vein, adding enough samples to the data(i.e. The fair coin flip
example) could make the data significant which could be misleading.

Alex Kramer
914241094, akramer
Reyes-Chin-Wo, S., Christopolous, M., McHale, L., Kozik, A., and Michelmore, R. ”Ex-
pansion of gene families encoding disease resistance related proteins in the Lactuca clade
within the Compositae”

Figure 1: Located on floor four of the Genome and Biomedical Sciences Facility

The poster presents research using bioinformatics methods on the genomes of lettuce, let-
tuce wild progenitor, chicory, and endive. These four genomes were recently sequenced

2
(i.e. their DNA sequence was determined), and this project used the sequence data in order
to predict where in the sequence genes occur, and the corresponding protein that would be
eventually be translated.

The authors used 0.001 as a threshold to ”filter” protein predictions, looking at only those
with p-value less than this. They also predicted the coil structure of these proteins, using
a similar method, this time choosing a p-value of 0.025. Especially without the context of
the full publication, these choices of p-values are fairly arbitrary, and possibly misleading.
Their significance test was probably something similar to: First, assume that the current
sequence does not code for a protein (the null hypothesis). Then, if their chosen test statis-
tic leads to a p-value less than their threshold, reject this null hypothesis and conclude that
they have likely found a protein. By setting a stringent threshold like this, their method
could lead one to conclude that observations that fall on one side or the other of the p-value
are inherently different. A sequence which has a p-value of 0.0011 for example may be
just as likely to be protein-coding as a sequence with p-value 0.0009, and thus they are
missing good candidates by setting a hard cutoff. A confidence interval is generally a more
informative approach. In this case, the researchers could look for, say, a 95% confidence
interval for a population parameter that is linked to protein coding. Then they would know
there is a 0.95 probability over many experiments that the mean of that parameter lies in the
interval. So if a new data point’s parameter lies in the interval, they could state something
to the effect of, ”the sequence’s parameter lies within our confidence interval, so it is likely
a protein-coding sequence because it’s population parameter is within x of the mean with
95% confidence” (i.e. they are 95% confident that the mean is within the range, in which
case the new sequence’s parameter will be close to the mean).

3
Davis Wong
912008297, kfwong

The poster chosen here is found on the UCDavis Undergraduate Research website
(https://urc.ucdavis.edu/photo-galleries/uc-davis-academic-posters).

This research is conducted by Laura Fonseca under the Department of Environmental Sci-
ence and Policy. The study analyzes a wave of infective fungus called Batrachochytrium
dendrobatidis which has caused mass die-offs in amphibians since 1966 to 2007. Because
these amphibians are predators to mosquitoes, the analysis attempts to find the correlation
between the number of incidents of malaria versus the decline of the healthy amphibian
population. The research compares the incidents of malaria.

The table above show the results which uses the p-value < 0.01, shows the increase of
malaria incidents as amphibian population decline. However, the reliance on just the p-
value to make conclusions may be misleading. Here, with only a p-value, we just reject

4
the null hypothesis that there is no difference between the means, though there can exist
some variation that causes a false-positive. We need both the p-value and confidence levels
to yield a more conclusive result. The sample size of malaria incidents are also quite large
7000 at peak so the p-value will likely be significant most of the time. Having a confidence
value will give us a range of populations of that were hit by malaria along with the healthy
amphibian population.

Rahul Udatha
912101851, rudatha
This poster aims to observe the habitat selection among golden-mantled ground squirrels
(GMGS) along with fluctuating density in the Rocky Mountain Biological Laboratory in
Gothic, Colorado. The habitat preference for females and males was measured by using
habitat availability in a 13-ha field site and comparing the observed values to the expected
values. The observations were made using dye markings according to the type of habi-
tat. Female GMGS show signs of matrilineal social structure and are presumed to located
based on territorial basis so the team estimated the relationship between density and habitat
while isolating gender.

There are a few limitations to using P-values. The basis of hypothesis testing relies on the
statement being significant or not. The assumed measure of significance, (p ¡ .05) should
not be mistaken for a reliable indicator that would validate the hypothesis. Two biological

5
observations are seldom identical, and the 5% significance is not absolute, but it relies
on some form on convention. Significance testing uses classification questions such as
“whether or not the variable is affected”, “are these pairs jointly significant?” and more to
provide results. The population size isn’t that large either and using a confidence interval
instead of a P-value would benefit the data because of the limited size. 4/5 of the P-values
in this poster are on the borderline of either accepting the null hypothesis or rejecting it.
A confidence interval would provide a range of average proportion used by using female
density as a base. This change would help set upper and lower limits and create a simpler
version in measuring data. These limits give us an idea of the area of effect.
This poster is unique because of the small sample size. This makes it difficult to use
a P-value and at the same time not fully sufficient enough to run a narrow CI. Still, it is
better to have a range of values or multiple ranges if categorized instead of a hypothesis
test which like this analysis is on a coin’s edge.

6
Problem B: Weather prediction
1 Introduction
The goal of this project is to apply linear regression and k-nearest neighbors methods of
modeling in order to predict various weather variables in the day1 dataset provided by the
UC Irvine Machine Learning Repository. Particularly, it aims to predict the general weather
type (clear, misty, snowy, ... ) as well as the real and ”feeling” temperature, humidity, and
wind speed.

1.1 Data
Our dataset consists of 731 observations over two years (2011 and 2012). There are seven-
teen features in the dataset, five of which we are interested in predicting. These variables
are:

weathersit: a value of 1,2,3, or 4 indicating the category of weather.

temp: temperature measured in degrees Celsius.

atemp: ”feeling” temperature.

hum: humidity

windspeed: wind speed

2 Feature Selection and Transformation


2.1 Exploratory Analysis
We began by examining our data and what we know about the domain. For example,
temperature and humidity in general are inversely related, because as air temperature in-
creases, the atmosphere is able to hold more water molecules, and as a result, the relative
humidity decreases. When temperatures drop, the relative humidity increases. We graphed
temperature vs. time and humidity vs. time, and this effect is apparent.

7
While the rising-falling pattern is less distinctive in the humidity vs. time graph, the high
points of humidity correspond to low points in the temperature vs. time graph. This trend
led us to keep in mind that temperature and humidity could be good predictors of each
other. We also generated scatter plots of each variable with all other variables using (Ap-

8
pendix F)

This led us to notice a few trends. weathersit did not appear to have much correlation
with the temporal variables. It takes on values 1, 2, and 3 with about equal frequency
in each season, month, and year. temperature varies through the months and seasons
as one would expect, with the warmest days in the summer and the coldest in the winter.

9
atemp matches the trends of temp very closely, with a few outlier values.

The number of bikers on a given day is positively correlated with the temperature. This
could be explained by the fact that warmer days are more pleasant, so more people will
ride a bike. The trend is stronger for casual riders, which makes sense because registered
riders might be more dedicated to biking regularly despite non-ideal weather conditions.

10
2.2 Initial Feature Selection
Before we began predicting individual variables, we were able to exclude some features
that we felt had little bearing on predicting weather, based on intuition and knowledge of
the physical processes of weather. We chose to discard the following columns from our
dataset:

instant: This variable is an ID number assigned to the data point. Because it is continually
increasing and independent of any real-world measurements, we feel it will have little
merit in prediction.

yr: The year in which the measurement occurred (1 for 2011, 2 for 2012). Including the
year as a predictor may have value if we are trying to predict weather in the same
time period that the data collection occurred. However, because we already have data
for each day in the two-year period of the study, we aim to create a model that will
extend well to other years. So to prevent over-fitting, we discard the year variable
(if the sample were across many years, factors like climate change could come into
effect, but because we only have two years of data we assume such effects are not
significant in the data).

holiday: A binary variable indicating whether the day is a holiday or not. We do not
think this variable has a significant predictive potential (if the holiday variable was
categorical, indicating if the holiday is Christmas, Hanukkah, Labor Day, ... it may
have a correlation to weather).

weekday: The day of the seven-day week. We find in unlikely for any significant weather
patterns to emerge in seven day cycles. We concluded that the only likely merit
this variable could have is the distinction between weekday and weekend (discussed
below). But this is contained within workingday, so we discard it.

11
tot: The total number of bikers on the day of the observation. Because this variable is an
exact linear combination of other variables (casual + registered), we discard it.

After removing these features, we initially thought we should also discard workingday.
But some further reasoning led us to avoid discarding it at this point. We thought that the
number of bikers on a certain day might have some predictive potential for the reasons
discussed above: more bikers on a day might indicate warmer weather. If this is the case,
then this trend may be modified based on whether or not it is a working day (for example
if it is the weekend, then more bikers could mean higher temperature, but this might not
hold true during the weekdays). So we chose to leave in workingday for now due to this
possible effect in combination with casual and registered. The code for this process is in
(Appendix B).

2.3 Feature Transformation (for Regression)


We made some transformations on features (Appendix B) to better model them for lin-
ear regression. We did not make these modifications (except for the first listed below) to
the features we used for k-nearest neighbors, because the magnitude of a variable in that
method doesn’t indicate it’s weight in prediction.

dteday: The date in string format can’t be used as is. The year and month are already
in other columns of day1, but we don’t have a column for day within the month, so
we created a new column monthday, into which we extracted the day of the month
from dteday.

monthday: This variable that we created resets to zero after reaching 31 (or 30, 29,
or 28). It is a circular variable. We can model circular variables by their sine and
cosine [1], which will eliminate the sudden jump back to zero and model the variable
on two smooth curves. We split monthday into two columns, sin monthday and
cos monthday, and removed the original column. The new columns take on a value
d d
equal to sin(π · 31 ) and cos(π · 31 ) respectively, where d is the day of the month.
While some months do not have 31 days, 30 , 29 , and 28
31 31 31
are close enough to one such
that the difference is not too great.

mnth: As above, this is a circular variable, so we created two new features, sin mnth and
m
cos mnth equal to sin(π · 12 ) with m being the month.

weathersit: In order to decouple magnitude from importance (e.g. a category value of


2 is twice as ”weighted” as 1 when performing regression) , we split this variable
into three binary variables, ws one, ws two, and ws three. Though there are four
categories, we have no data points that fall in category four. Thus that category will
add nothing to the regression function (the coefficient will always be zero), so we
exclude it. This also means that we will never predict weathersit to be four.

season: As above, we split this categorical variable into winter, spring, summer, and
fall.

12
2.4 Validation
Before we progressed further, we needed a way to measure how well our model was per-
forming. We chose to use cross-validation and measure the mean squared error of our
prediction versus the real data. Canonical k-fold validation, in which one randomly selects
portions of the data to ”hold out” for testing and trains on the rest, would not be effective in
our case, because it eliminates the temporal patterns inherent in time-series data. We chose
instead to use a ”rolling” method of validation [2].

1. Split the observations into 25 blocks of (approximately) equal size.

2. Train on the first block, then test on the block immediately following the first.

3. Calculate and store the squared error of each prediction.

4. Train on the first two blocks, and test on the following block. Calculate error.

5. Continue in this manner until the training set is as large as our data.

6. Report the mean of all of the squared errors.

Visually, this method can be depicted as follows:

Prediction of scalar variables is straightforward (Appendix A, cross(), cross knn()),


but we needed some modifications in order to predict the categorical weathersit (Appendix
A, cross cat(), cross cat knn()). To do this, we predicted ws one and ws two
(i.e. the probability that each equals 1), and inferred P (w three = 1) as 1 − (P (w one =
1) + P (w two = 1)). Our predicted value of weathersit was the maximum of these
three predicted probabilities. The squared error is then the number of times we predicted
incorrectly (squaring doesn’t change the value, because it is either 1 or 0).

2.4.1 Why Cross-validate?


We chose to measure the mean squared error over groups of testing data. Another approach
would be to measure the error after training the model on the entire set of observations. This
approach may lead to overfitting, however, and the end goal is to develop a model that will
extend well to new data, rather than predict data it has already ”seen.”

13
2.5 Linear Regression Model Selection
At this point, we had a baseline model that we were ready to tweak further. Here is the list
of the (18) features in that model: workingday, winter, spring, summer, fall, ws one,
ws two, ws three, sin monthday, cos monthday, sin mnth, cos mnth, casual, regis-
tered, temp, atemp, hum, windspeed. This is a lot of features.

2.5.1 Number of Prior Days


The number of total predictors will be k · N if k is the number of prior days and N is the
number of features. A rule of thumb in the textbook states that one should aim to keep the
number of predictors less than the square root of the number of observations. Assuming
we use all 731-k (the first k have NA entries) observations to train the model, we should
have no more than ≈ 27 predictors. With eighteen features, then, we can only use k = 1.
Next, we tried to reduce the number of features so that we could use more prior days, as
well as to prevent overfitting.

2.5.2 Validation Baseline


We ran our rolling validation routine on this model, including all eighteen features, with
k = 1. We also chose to validate a null model, in which our response variable is predicted
by only by a constant ( e.g. lm(y ∼ 1) ). Our results:

Response Variable MSE Null Model MSE


temp 7.8034 85.2675
atemp 15.5529 133.1684
windspeed 31.2240 27.4515
hum 0.0209 0.0201
weathersit 0.40656 0.3638

Our model performs reasonably better than the model with no predictors for the tempera-
ture variables, but clearly needs improvement for the three others.

2.5.3 Initial Attempts


In trying to reduce the number of features in our model, we first had the idea of generating
all possible combinations of features, and then measuring the error of each model using
cross-validation and choosing the best. We explored using the dredge function from the
MuMIn package to enumerate all possible models (the list of models was an impressive 4.2
Gigabytes in size). We quickly realized this would be computationally time-consuming.

2.5.4 Correlation
Earlier we removed tot because it was exactly correlated to the sum of casual and regis-
tered. Now, we searched for variables that were correlated, though not necessarily com-
pletely. Using the Caret package, we found (Appendix B) that temp and atemp had a
high degree of correlation (≥ 0.75). We decided to remove atemp from our models, except

14
for the model predicting atemp, in which case we removed temp. This search for corre-
lated variables also reminded us that we should remove variables with exact correlation, so
for weathersit, we only need to include ws one and ws two. And similarly we only need
three indicator variables for season.

After removing these correlated variables, our errors decreased slightly:

Response Variable MSE


temp 7.6696
atemp 15.4058
windspeed 30.9554
hum 0.0189
weathersit 0.40371

We are now predicting humidity better than the null model.

2.5.5 Feature Importance


We began to consider each variable independently to further select features. The caret
package also includes a method of determining variable ”importance.” This method (Ap-
pendix C) uses the t-statistic of each variable in order to rank their predictive potential.
Here are the results of this applied to each variable with k = 1:

15
Using the results of this along with trial and error, we came up with the following models
(with one-day lookback):

Response Predictors
temp: temp, windspeed, sin mnth, cos mnth, summer
atemp: atemp, windspeed, sin mnth, cos mnth, summer
wind: windspeed, atemp, sin mnth, cos mnth, ws one, registered
hum: hum, windspeed, cos monthday, cos mnth, ws one
weathersit: ws one, ws two, hum, sin mnth, summer

With our new models, our errors decreased further to:

16
17
18
Response Variable MSE
temp 6.7194
atemp 13.6986
windspeed 26.9555
hum 0.01462
weathersit 0.3566

Now all of our predictions perform better than if we had no predictors.

2.5.6 Higher Order Terms


To further improve our prediction accuracy, we tried adding some higher order polynomial
terms to our models (Appendix D). For temperature and feeling temperature, including
higher order terms had no effect on the error. For windspeed, we added the square of
humidity and feeling temperature. Squaring the feeling temperature rather than true tem-
perature makes sense in this case because the degree of wind makes a difference in how
cold it feels on a certain day. We squared humidity for predicting humidity as well. For
weathersit, we found that adding the cube of humidity was effective.

2.5.7 Number of Days Ago, Revisited


Throughout all of our testing, increasing the number of prior days we consider always led
to an increase in mean squared error. We had the best results with k = 1: only predicting
based on yesterday’s conditions.

2.5.8 Final Models for Regression


The final model for each weather variable are listed below, along with the MSE of each
(Appendix E), using our rolling validation method (each predictor is the value of that fea-
ture from the previous day):

Let α = temp, ζ = atemp, γ = windspeed, δ = mnth, η = summer,


ω1 = ws one, ω2 = ws two, φ = mnthday, σ = registered, ρ = hum, each being that
variable’s value from one day ago.

19
Response Formula MSE

temp: 2.8210 + 0.7107α − 0.1948γ + 5.8730 sin(δ) − 1.0079 cos(δ) + 2.8210η 6.7193

atemp: 2.0153 − 0.2221γ + 8.7962 sin(δ) − 1.7136 cos(δ) + 0.6558ζ + 1.3973η 13.8570

11.2025 + 0.2955γ − 3.1967 sin(δ) + 1.9173 cos(δ)


wind: 25.6080
−1.0383ω1 − 0.1654η − 0.0003σ − 0.7627ρ2 − 0.0060ζ 2
hum: 0.5731 − 0.0102 cos(φ) − 0.0174 cos(δ) − 0.0115ω1 + 0.0450ρ + 0.2991ρ2 − 0.0072γ 0.0145

weathersit: 1.6384 − 0.0746η − 0.0904 sin(δ) − 0.4549ω1 − 0.2504ω2 + 0.255ρ + 0.1589ρ3 0.3381

3 K-nearest Neighbors
Next we applied the k-nearest neighbors method to all of the weather variables. For our
features, we discarded any transformed variables like sin mnth and (hum)2 . The results
of running basicKNN() from the regtools package are summarized below:

Response Predictors MSE k


temp: temp, windspeed, summer 10.922 5
atemp: atemp, windspeed, summer 20.09 6
wind: windspeed, summer, atemp, ws one 25.77 18
hum: hum, windspeed, ws one 0.0154 10
weathersit: ws one, ws two, hum, summer 0.3524 15

The k-nearest neighbors method does not create a ”model” for predicting new data, it just
uses the k closest (by Euclidean distance) data points in order to make a prediction. In this
sense it is somewhat simpler than finding a good linear regression model, because we do
not have to perform feature transformation to reduce the error of our model (instead we can
change the parameter k). We performed the same rolling validation method as described
above, and measured the mean squared error. We tested different values of k, and chose
the one with the lowest MSE for each response variable.

4 Comparison
Our linear regression models performed better than the k-nearest neighbors method for all
weather variables.

20
Response MSE (regression) MSE(KNN)
temp: 6.7193 10.922
atemp: 13.8570 20.09
wind: 25.6080 25.77
hum: 0.0145 0.0154
weathersit: 0.3381 0.3524

5 Epilogue: Principal Component Analysis


For Problem B, we weren’t permitted to use methods outside of KNN and linear regression
(”171 stuff”), but after we completed the standard KNN and linear regression approaches
as described above, we had the idea to apply dimensionality reduction methods to our data,
as an experiment to see if we could improve our results.

What is Principal Component Analysis


Principal Component Analysis or PCA is a method that uses an orthogonal transformation
to convert observations of possibly correlated variables into a set of values of linearly un-
correlated variables called principal components [3]. This is accomplished by calculating
eigenvalues. The eigenvector provides a direction of the ”line” and the eigenvalue provides
the variance of the projected points to that line. Additionally, since our goal is to max-
imize the variance of the original data with principal components, we want to maximize
the eigenvalues (projected variance) (definition provided by Professor Tagkopoulos). It is
important to note that PCA is not feature removal; all features are needed to construct the
principal components.

A Simpler Explanation
In simpler terms, we use PCA to reduce the dimension of the features into principal com-
ponents that hope to explain the variance of the original data. This reduction is to simplify
the data as well as make it possible to visualize the data in addition to possibly being a
valid input to our models.

Our Plan
R provides a function prcomp which we used to run PCA on the data we supplied it. The
following sections present the results and the process we went through to get the results for
one day of lag data. We compared this data to see if it could provide better results than our
feature selection method.

5.1 Features Used


Earlier we transformed the data with sine and cosine functions applied to the day of the
month. For PCA we decided to simplify the process and include only the original rele-
vant variables such as season, weathersit, temp, atemp, humidity, windspeed, casual and
registered. Similar to the previous models we made the categorical variables into dummy
variables.

21
5.2 One Day Lag
5.2.1 Results
(All figures in this subsection refer to the figures below)
The picture on the top left of the figures below is a summary of running PCA on our data.
The two relevant rows are ”Proportion of Variance” with each entry describing how much
variance that principal component accounts for in the data and ”Cumulative Proportion”
which is the accumulated amount of explained variance.

The plot on the bottom right titled Screeplot of the first 10 PCs” is a plot of eigenval-
ues computed. The plot titled Cumulative variance plot shows the variance explained by
the principal components as more are added.

The PC1/PC2 - plot represents about 52 percent of the variance of the original data set.

From this information, we decided to include the first 7 principal components as we thought
that it explained about 90 percent of the variance of the original data set. We validated our
results in the same manner as the previous linear regression and knn so we could compare
our results to see if PCA could best either method.

22
5.2.2 comparison
Response MSE (regression) MSE(KNN) MSE(PCA+regression) MSE(PCA+KNN)
temp: 6.7193 10.922 28.72989 26.30407
atemp: 13.8570 20.09 48.44823 44.02425
wind: 25.6080 25.77 143.09050 25.86798
hum: 0.0145 0.0154 0.57932013 0.01687296
weathersit: 0.3381 0.3524 0.3780314 0.3708987

As we can see from these results, PCA resulted in a higher test MSE meaning that the
lost value from reducing the data was significant. The MSE for PCA and regression was
especially unsubstantial as each error was multiples of times greater than either the basic
regression or KNN result. The same can be said for PCA and KNN MSE’s but to a lesser
degree. We can conclude that overall, linear regression is the superior model among the
ones we tested.

6 Future Considerations
There were some more methods we wanted to try to improve our models and experiment
with. We wanted to explore more feature selection options with the AIC metric to influence
our feature selection decisions. There are also other methods of dimension reduction such
as UMAP and tSNE that we could explore. And of course if we were provided more data,
we could build a more accurate model.

7 Contributions
Alex was responsible for linear regression model development, the method of cross-validation,
performing feature transformation, and the corresponding code and write-up portions, as

23
well as type-setting the final report.

Davis worked on manipulating the data in R, adding higher order terms to the model,
and other modifications to the linear regression model and the report.

Jatin worked on feature selection (variable importance, correlation), KNN prediction and
cross-validation, dimensionality reduction (PCA), and the write-up of the relevant sections.

Rahul performed exploratory data analysis, generated plots, and described the results in
the report, and contributed to refining the linear regression model.

Appendix
A Functions for Validation
c r o s s <− f u n c t i o n ( r e s p , data , n ) {
s q e r r o r s <− c ( )
s q e r r o r s n u l l <− c ( )
k <− n
w h i l e ( k <= nrow ( d a t a ) ) {
t r a i n <− d a t a [ 1 : k , ]
i f ( k + n > nrow ( d a t a ) ) {
bound <− nrow ( d a t a )
} else {
bound <− k+n
}
t e s t <− c b i n d ( d a t a [ ( k + 1 ) : bound , ] , r e s p [ ( k + 1 ) : bound ] )
model <− lm ( r e s p [ 1 : k ] ˜ . , d a t a = t r a i n )
n u l l model <− lm ( r e s p [ 1 : k ] ˜ 1 , d a t a = t r a i n )
p r e d <− p r e d i c t ( model , n e w d a t a = t e s t )
s q e r r o r <− ( p r e d −r e s p [ ( k + 1 ) : bound ] ) ˆ 2
p r e d n u l l <− p r e d i c t ( n u l l model , n e w d a t a = t e s t )
s q e r r o r n u l l <− ( p r e d n u l l − r e s p [ ( k + 1 ) : bound ] ) ˆ 2
s q e r r o r s <− c ( s q e r r o r s , s q e r r o r )
s q e r r o r s n u l l <− c ( s q e r r o r s n u l l , s q e r r o r n u l l )
k <− k + n
}

mse <− mean ( s q e r r o r s )


mse n u l l <− mean ( s q e r r o r s n u l l )
r e t u r n ( c ( mse , mse n u l l ) )
}
c r o s s knn <− f u n c t i o n ( r e s p , data , n , kmax ) {

24
s q e r r o r s <− c ( )
k <− n
w h i l e ( k <= nrow ( d a t a ) ) {
t r a i n <− d a t a [ 1 : k , ]
i f ( k + n > nrow ( d a t a ) ) {
bound <− nrow ( d a t a )
} else {
bound <− k+n
}
t e s t <− d a t a [ ( k + 1 ) : bound , ]
r e a l <− r e s p [ ( k + 1 ) : bound ]

p r e d <− r e g t o o l s : : basicKNN ( x = t r a i n ,
y = resp [1: k ] ,
newx = t e s t ,
kmax = kmax ) $ r e g e s t s

s q e r r o r s <− c ( s q e r r o r s , ( p r e d − r e a l ) ˆ 2 )
k <− k + n

}
mse <− mean ( s q e r r o r s )
r e t u r n ( mse )
}
c r o s s c a t <− f u n c t i o n ( df , data , n ) {
s q e r r o r s <− c ( )
s q e r r o r s n u l l <− c ( )
k <− n
w h i l e ( k <= nrow ( d a t a ) ) {
t r a i n <− d a t a [ 1 : k , ]
i f ( k + n > nrow ( d a t a ) ) {
bound <− nrow ( d a t a )
} else {
bound <− k+n
}
t e s t 1 <− c b i n d ( d a t a [ ( k + 1 ) : bound , ] , d f $ws one [ ( k + 1 ) : bound ] )
t e s t 2 <− c b i n d ( d a t a [ ( k + 1 ) : bound , ] , d f $ws two [ ( k + 1 ) : bound ] )
model 1 <− glm ( d f $ws one [ 1 : k ] ˜ . , d a t a = t r a i n )
model 2 <− glm ( d f $ws two [ 1 : k ] ˜ . , d a t a = t r a i n )
n u l l model 1 <− lm ( d f $ws one [ 1 : k ] ˜ 1 , d a t a = t r a i n )
n u l l model 2 <− lm ( d f $ws two [ 1 : k ] ˜ 1 , d a t a = t r a i n )
p r e d 1 <− p r e d i c t ( model 1 , n e w d a t a = t e s t 1 )
p r e d 2 <− p r e d i c t ( model 2 , n e w d a t a = t e s t 2 )
p r e d n u l l 1 <− p r e d i c t ( n u l l model 1 , n e w d a t a = t e s t 1 )
p r e d n u l l 2 <− p r e d i c t ( n u l l model 2 , n e w d a t a = t e s t 2 )
p r e d n u l l 1 <− p r e d n u l l 1 == max ( p r e d n u l l 1 , p r e d n u l l 2 )
p r e d n u l l 2 <− p r e d n u l l 2 == max ( p r e d n u l l 1 , p r e d n u l l 2 )

25
p r e d 3 <− 1 − ( p r e d 1 + p r e d 2 )
p r e d n u l l 3 <− 1 − ( p r e d n u l l 1 + p r e d n u l l 2 )
mx <− pmax ( p r e d 1 , p r e d 2 , p r e d 3 )
mx n u l l <− pmax ( p r e d n u l l 1 , p r e d n u l l 2 , p r e d n u l l 3 )
p r e d <− i f e l s e ( p r e d 1 == mx , 1 ,
i f e l s e ( p r e d 2 == mx , 2 , 3 ) )
p r e d n u l l <− i f e l s e ( p r e d n u l l 1 == mx n u l l , 1 ,
i f e l s e ( p r e d n u l l 2 == mx n u l l , 2 , 3 ) )

s q e r r o r <− p r e d ! = d f $ws [ ( k + 1 ) : bound ]


s q e r r o r n u l l <− p r e d n u l l ! = d f $ws [ ( k + 1 ) : bound ]
s q e r r o r s <− c ( s q e r r o r s , s q e r r o r )
s q e r r o r s n u l l <− c ( s q e r r o r s n u l l , s q e r r o r n u l l )
k <− k + n
}

mse <− mean ( s q e r r o r s )


mse n u l l <− mean ( s q e r r o r s n u l l )
r e t u r n ( c ( mse , mse n u l l ) )
}
c r o s s c a t knn <− f u n c t i o n ( df , data , n , kmax ) {
s q e r r o r s <− c ( )
k <− n
w h i l e ( k <= nrow ( d a t a ) ) {
t r a i n <− d a t a [ 1 : k , ]
i f ( k + n > nrow ( d a t a ) ) {
bound <− nrow ( d a t a )
} else {
bound <− k+n
}
t e s t 1 <− d a t a [ ( k + 1 ) : bound , ]
t e s t 2 <− d a t a [ ( k + 1 ) : bound , ]

p r e d 1 <− r e g t o o l s : : basicKNN ( x = t r a i n ,
y = d f $ws one [ 1 : k ] ,
newx = t e s t 1 ,
kmax=kmax ) $ r e g e s t s
p r e d 2 <− r e g t o o l s : : basicKNN ( x = t r a i n ,
y = d f $ws two [ 1 : k ] ,
newx = t e s t 2 ,
kmax=kmax ) $ r e g e s t s
p r e d 3 <− 1 − ( p r e d 1 + p r e d 2 )
mx <− pmax ( p r e d 1 , p r e d 2 , p r e d 3 )
p r e d <− i f e l s e ( p r e d 1 == mx , 1 ,
i f e l s e ( p r e d 2 == mx , 2 , 3 ) )

26
s q e r r o r <− p r e d ! = d f $ws [ ( k + 1 ) : bound ]
s q e r r o r s <− c ( s q e r r o r s , s q e r r o r )
k <− k + n
}

mse <− mean ( s q e r r o r s )


r e t u r n ( mse )
}

B Initial Feature Selection / Processing


s e l e c t f e a t u r e s <− f u n c t i o n ( d a t a mat , c u t ) {
c o r r e l a t i o n M a t r i x <− c o r ( d a t a mat )
# f i n d a t t r i b u t e s t h a t a r e h i g h l y c o r r e c t e d ( i d e a l l y >0.75)
h i g h l y C o r r e l a t e d <− f i n d C o r r e l a t i o n ( c o r r e l a t i o n M a t r i x , c u t o f f = c u t )
# print ( highlyCorrelated )
}

d tmp <− day1


# r e p l a c e d t e d a y w i t h day o f t h e month −> monthday
d tmp [ , ” d t e d a y ” ] <− s t r t o i ( s u b s t r ( day1 [ , ” d t e d a y ” ] , s t a r t =9 , s t o p = 1 0 ) ,
b a s e =10L )
colnames ( d tmp ) [ colnames ( d tmp )== ” d t e d a y ” ] <− ” monthday ”

# s p l i t s e a s o n and w e a t h e r s i t i n t o i n d i c a t o r v a r i a b l e s
d tmp [ , ” w i n t e r ” ] <− i f e l s e ( d tmp [ , ” s e a s o n ” ] == 1 , 1 , 0 )
d tmp [ , ” s p r i n g ” ] <− i f e l s e ( d tmp [ , ” s e a s o n ” ] == 2 , 1 , 0 )
d tmp [ , ” summer ” ] <− i f e l s e ( d tmp [ , ” s e a s o n ” ] == 3 , 1 , 0 )
# d tmp [ , ” f a l l ”] <− i f e l s e ( d tmp [ , ” s e a s o n ”] == 4 , 1 , 0 )
# removed due t o c o r r e l a t i o n
d tmp [ , ” ws one ” ] <− i f e l s e ( d tmp [ , ” w e a t h e r s i t ” ] == 1 , 1 , 0 )
d tmp [ , ” ws two ” ] <− i f e l s e ( d tmp [ , ” w e a t h e r s i t ” ] == 2 , 1 , 0 )
# d tmp [ , ” ws t h r e e ”] <− i f e l s e ( d tmp [ , ” w e a t h e r s i t ”] == 3 , 1 , 0 )
# removed due t o c o r r e l a t i o n

# r e p l a c e monthday and month w i t h trig functions


d tmp [ , ” s i n monthday ” ] <− s i n ( ( p i ∗ d tmp [ , ” monthday ” ] ) / 3 1 )
d tmp [ , ” c o s monthday ” ] <− c o s ( ( p i ∗ d tmp [ , ” monthday ” ] ) / 3 1 )
d tmp [ , ” s i n mnth ” ] <− s i n ( ( p i ∗ d tmp [ , ” mnth ” ] ) / 1 2 )
d tmp [ , ” c o s mnth ” ] <− c o s ( ( p i ∗ d tmp [ , ” mnth ” ] ) / 1 2 )

ws <− d tmp [ , 9 ]
x <− d tmp
d tmp <− c b i n d ( d tmp [ , 8 ] , d tmp [ , 1 7 : 1 9 ] , d tmp [ 2 2 : 2 5 ] ,
d tmp [ , 1 4 : 1 5 ] , d tmp [ 2 0 : 2 1 ] , d tmp [ , 1 0 : 1 3 ] )
colnames ( d tmp ) [ 1 ] <− c ( ” w o r k i n g d a y ” )
# remove t o t , w e a t h e r s i t , h o l i d a y , weekday , y e a r , r e g i s t e r e d

27
d <− d a t a . m a t r i x ( d tmp , rownames . f o r c e =NA)
d a y s a g o <− 1
s h i f t e d <− NULL
m <− NULL
c o l s <− NULL
for ( i in 1: daysago ) {
day c o l <− m a t r i x (NA, nrow= i , n c o l = n c o l ( d ) )
c o l s <− c ( c o l s , day c o l )
day c o l <− r b i n d ( day c o l , d [ 1 : ( 7 3 1 − i ) , ] )
colnames ( day c o l ) <− c ( p a s t e ( colnames ( day c o l ) , i , ’ ago ’ ) )
m <− c b i n d (m, day c o l )
}

m <− c b i n d (m, ws , d [ , ( n c o l ( d ) − 6 ) : n c o l ( d ) ] )
m <− m[ ( d a y s a g o + 1 ) : nrow (m) , ]

n <− f l o o r ( ( 7 3 1 − d a y s a g o ) / 2 5 ) # s p l i t i n t o 25 c h u n k s
m d f <− a s . d a t a . frame (m)
m d a t a <− m d f [ , 1 : ( n c o l (m d f ) − 8 ) ]

s e l e c t f e a t u r e s (m data , 0 . 7 5 )

C Variable Importance
i m p o r t a n c e <− f u n c t i o n ( r e s p , d a t a ) {
imp <− a s . d a t a . frame ( v a r I m p ( lm ( r e s p ˜ . , d a t a = d a t a ) , s c a l e =FALSE ) )
imp <− d a t a . frame ( o v e r a l l = imp $ O v e r a l l ,
names = rownames ( imp ) )
imp <− imp [ o r d e r ( imp $ o v e r a l l , d e c r e a s i n g = T ) , ]
r e t u r n ( imp )
}
c o n t r o l <− t r a i n C o n t r o l ( method =” r e p e a t e d c v ” , number =10 , r e p e a t s = 3 )
p r i n t ( ” temp : ” )
p r i n t ( i m p o r t a n c e (m d f $ temp , m d a t a ) )
p r i n t ( ” atemp : ” )
p r i n t ( i m p o r t a n c e (m d f $ atemp , m d a t a ) )
print ( ” windspeed : ” )
p r i n t ( i m p o r t a n c e (m d f $ w i n d s p e e d , m d a t a ) )
p r i n t ( ”hum : ” )
p r i n t ( i m p o r t a n c e (m d f $hum , m d a t a ) )
print ( ” weathersit : ” )
p r i n t ( i m p o r t a n c e (m d f $ws , m d a t a ) )

28
D Final Feature Selection
t <− c ( 4 , 7 , 8 , 1 3 , 1 6 )
a <− c ( 4 , 7 , 8 , 1 4 , 1 6 )
w <− c ( 7 , 8 , 1 1 , 1 4 , 1 6 , 4 , 1 0 )
h <− c ( 6 , 8 , 1 1 , 1 5 , 1 6 )
ws <− c ( 4 , 7 , 1 1 , 1 2 , 1 5 )

m data temp <− m d a t a [ t ]


m data atemp <− m d a t a [ a ]
m data wind <− m d a t a [w]
m data wind <− c b i n d (m d a t a wind , m d a t a [ , 1 5 ] ˆ 2 ,
m d a t a [ , 1 4 ] ˆ 2 ) # s q hum , atemp
m d a t a hum <− m d a t a [ h ]
m d a t a hum <− c b i n d (m d a t a hum , m d a t a [ , 1 5 ] ˆ 2 )

m d a t a ws <− m d a t a [ ws ]
m d a t a ws <− c b i n d (m d a t a ws , m d a t a [ , 1 5 ] ˆ 3 ) # c u b e hum

E Reporting Validation Results


p r i n t ( c r o s s (m d f $ temp , m d a t a temp , n ) )
p r i n t ( c r o s s (m d f $ atemp , m d a t a atemp , n ) )
p r i n t ( c r o s s (m d f $ w i n d s p e e d , m d a t a wind , n ) )
p r i n t ( c r o s s (m d f $hum , m d a t a hum , n ) )
p r i n t ( c r o s s c a t (m df , m d a t a ws , n ) )

print ( cross knn (m d f $ temp , m d a t a temp [− c ( 2 , 3 ) ] , n , 5 ) )


print ( cross knn (m d f $ atemp , m d a t a atemp [− c ( 2 , 3 ) ] , n , 6 ) )
print ( cross knn (m d f $ w i n d s p e e d , m d a t a wind [− c ( 1 , 2 , 6 , 7 ) ] , n , 1 8 ) )
print ( cross knn (m d f $hum , m d a t a hum[− c ( 1 , 2 , 6 ) ] , n , 1 0 ) )
print ( cross c a t knn (m df , m d a t a ws[− c ( 2 , 6 ) ] , n , 1 5 ) )

F Plots
p a i r s ( day1 [ , ] , pch = 1 9 , l o w e r . p a n e l = NULL)
p l o t ( day1 $ mnth ˜ day1 $ temp )
p l o t ( day1 $ temp ˜ day1 $ atemp )
p l o t ( day1 $ temp ˜ day1 $ c a s u a l )
p l o t ( day1 $ temp ˜ day1 $ r e g i s t e r e d )
p l o t ( day1 $ temp , x l a b =” O b s e r v a t i o n s ” ,
y l a b =” T e m p e r a t u r e ( ∗C) ” , main=” T e m p e r a t u r e a c r o s s Time S e r i e s ” ,
ype =” l ” , c o l =” r e d ” )
p l o t ( day1 $hum , y l a b =” H u m i d i t y ” ,
main=” H u m i d i t y a c r o s s Time S e r i e s ” , t y p e =” l ” ,
c o l =” b l u e ” , x l a b =” O b s e r v a t i o n s ” )

29
a b l i n e ( h = 1 , c o l =” r e d ” , l t y = 5)
l e g e n d ( ” t o p r i g h t ” , l e g e n d =c ( ” E i g e n v a l u e = 1 ” ) ,
c o l =c ( ” r e d ” ) , l t y =5 , c e x = 0 . 6 )
cumpro <− cumsum (m d a t a . p r $ s d e v ˆ 2 / sum (m d a t a . p r $ s d e v ˆ 2 ) )
p l o t ( cumpro [ 0 : 1 5 ] , x l a b = ”PC # ” ,
y l a b = ” Amount o f e x p l a i n e d v a r i a n c e ” , main = ” C u m u l a t i v e v a r i a n c e p l o t ”
a b l i n e ( v = 7 , c o l =” b l u e ” , l t y = 5)
a b l i n e ( h = 0 . 9 3 7 5 , c o l =” b l u e ” , l t y = 5)
p l o t (m d a t a . p r $x [ , 1 ] ,m d a t a . p r $x [ , 2 ] ,
x l a b =”PC1 ( 3 2 . 6 % ) ” , y l a b = ”PC2 ( 2 0 . 0 6 % ) ” ,
main = ”PC1 / PC2 − p l o t ” )

G PCA
[3] provided useful examples for this section
pca data = m data [ , − c ( 5 , 6 , 7 , 8 ) ]
m d a t a . p r <−prcomp ( p c a data , c e n t e r = TRUE , s c a l e = TRUE)
s c r e e p l o t (m d a t a . pr , t y p e = ” l ” , n p c s = 1 5 ,
main = ” S c r e e p l o t o f t h e f i r s t 10 PCs ” )
2 p c a t <− c ( 1 , 2 , 3 , 4 , 5 , 6 , 7 )
p c a temp <− m d a t a . p r $x [ , c ( p c a t ) ]
p c a temp i n <− a s . d a t a . frame ( p c a temp )

# L i n e a r r e g r e s s i o n w i t h PCA
p r i n t ( ’−−− L i n e a r R e g r e s s i o n MSE ( o u r s v s . n u l l ) w i t h PCA−−−’ )
p r i n t ( ” temp ” )
p r i n t ( c r o s s (m d f $ temp , p c a temp i n , n ) )
p r i n t ( ” atemp ” )
p r i n t ( c r o s s (m d f $ atemp , p c a temp i n , n ) )
p r i n t ( ” wind ” )
p r i n t ( c r o s s (m d f $ w i n d s p e e d , p c a temp i n , n ) )
p r i n t ( ”hum” )
p r i n t ( c r o s s (m d f $hum , p c a temp i n , n ) )
print ( ” weathersit ” )
p r i n t ( c r o s s c a t (m df , p c a temp i n , n ) )

#KNN w i t h pca o u t p u t
p r i n t ( ’−−− KNN MSE w i t h PCA −−− ’ )
p r i n t ( ” temp ” )
p r i n t ( c r o s s knn (m d f $ temp , p c a temp i n , n , 5 ) )
p r i n t ( ” atemp ” )
p r i n t ( c r o s s knn (m d f $ atemp , p c a temp i n , n , 6 ) )
p r i n t ( ” wind ” )
p r i n t ( c r o s s knn (m d f $ w i n d s p e e d , p c a temp i n , n , 1 8 ) )
p r i n t ( ”hum” )
p r i n t ( c r o s s knn (m d f $hum , p c a temp i n , n , 1 0 ) )
print ( ” weathersit ” )

30
p r i n t ( c r o s s c a t knn (m df , p c a temp i n , n , 1 5 ) )

References
[1] Kadhem Al-Daffaie and Shahjahan Khan. Logistic regression for circular data.

[2] Rob J Hyndman. Cross-validation for time series, Dec 2016.

[3] Peter Nistrup. Principal component analysis (pca) 101, using r, Jun 2019.

31

You might also like