Prediction Model
Prediction Model
Prediction Model
This model is prepared for the purpose of election forecasting which is mainly the art and
science of predicting the winner of elections before actual’ casting of votes using polling data
from likely voters.
Here, the primal focus will be on the US presidential elections which are conducted after
every 4 years.
There are mainly 2 competitive candidates, i.e., Republican and Democratic Candidate
While in most countries, majority is considered for deciding on the winner of the elections, in
USA, that isn’t the case.
There are 50 states in the United States, and each is assigned a number of electoral votes
based on its population.
The candidate who receives the most votes in that state gets all of its electoral votes. And
then across the entire country, the candidate who receives the most electoral votes wins the
entire presidential election.
Data from RealClearPolitics.com that basically represents polling data that was collected in
the months leading up to the 2004, 2008, and 2012 US presidential elections was used.
Each row in the data set represents a state in a particular election year.
And the dependent variable, which is called Republican, is a binary outcome
It's 1 if the Republican won that state in that particular election year, and a 0 if a Democrat
won
The independent variables, again, are related to polling data in that state.
So for instance, the Rasmussen and SurveyUSA variables are related to two major polls that
are assigned across many different states in the United States.
And it represents the percentage of voters who said they were likely to vote Republican
minus the percentage who said they were likely to vote Democrat. So for instance, if the
variable SurveyUSA in our data set has value -6, it means that 6% more voters said they were
likely to vote Democrat than said they were likely to vote Republican in that state.
DiffCount counts the number of all the polls leading up to the election that predicted a
Republican winner in the state, minus the number of polls that predicted a Democratic
winner.
And PropR, or proportion Republican, has the proportion of all those polls leading up to the
election that predicted a Republican winner.
Also, as it is not known as to what model would be a better fit in this case, we will first run
regressions for data of 2004 and 2008, decide on the best model and then apply it to 2012
data.
EXPLANATION AND RESULTS
Herein, I’ll train data from the 2004 and 2008 elections, and test on data from the 2012
presidential election.
First of all, a data frame called Train is created using the subset function that breaks down
the original polling data frame and only stores the observations when either the Year was
2004 or when the Year was 2008
Another subset, ‘Test’ would be created to save the values for 2012 polling data
Next, we need to understand the prediction of our baseline model against which we would
compare the logistic regression model
For this, we need to check the breakdown of the dependent variable in the training set
Interpretation : It is reflective of the fact that in 47 out of the 100 observations, the
democrats won the elections and in 53 Republicans won the elections.
So, it can be stated that the baseline model is going to predict the more common outcome
,i.e., Republicans won in the state and will have 53% accuracy in the results.
It is a weak model because it will predict Republican even for the state where Democrats
have higher chances of winning or are polling 15 to 20% ahead of Republicans.
So, a better baseline model would be one which considers only one state. In this case,
Rasmussen should be considered and prediction should be based on as to which party was
picked out by the polls as the actual winner in the particular state and then decide on the
ultimate result.
In this case sign function should be used, which would return positive value, if Republicans
are winning, -1, if democrats are winning and 0 if the model is inconclusive or if there is a tie.
Results of the baseline model :
It is reflective of the fact that Republicans are predicted to win in 55 states, democrats in 42
states and the model was inconclusive with respect to 3 states.
Comparison of Baseline model 1 with baseline model 2
There is a high degree of correlation among the variables. So, we can consider involving only one
predictor at a time.
In the model 1, we will be predicting the probability of success of republicans on the basis of the
polling PropR.
Interpretation :
For every 1 unit change in PropR, the log-odds of Republicans winning against democrats
increase by 11.390 and the results are also statistically significant.
Next, we need to check the predictability prowess of this model
Interpretation: In the columns, 0 shows that the democrat won and 1 shows that the
republicans won. ‘True’ shows that we predicted Republican and ‘False’ shows that we
predicted Democrat.
The results are clearly depictive of the fact that we were incorrect at 4 places and this
accuracy is very close to the baseline as set.
So, we need to further improvise on the model.
TWO VARIABLE MODEL
Going back to the correlation matrix, we will have to check for those variables which have
comparatively lower correlation with each other.
For model 2, we will consider SurveyUSA and Diffcount as dependent variables and compute
the predictions.