Prediction Model

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

PREDICTION MODEL FOR US PRESIDENTIAL ELECTIONS USING R SOFTWARE

Explanation of the model:

 This model is prepared for the purpose of election forecasting which is mainly the art and
science of predicting the winner of elections before actual’ casting of votes using polling data
from likely voters.
 Here, the primal focus will be on the US presidential elections which are conducted after
every 4 years.
 There are mainly 2 competitive candidates, i.e., Republican and Democratic Candidate
 While in most countries, majority is considered for deciding on the winner of the elections, in
USA, that isn’t the case.
 There are 50 states in the United States, and each is assigned a number of electoral votes
based on its population.
 The candidate who receives the most votes in that state gets all of its electoral votes. And
then across the entire country, the candidate who receives the most electoral votes wins the
entire presidential election.
 Data from RealClearPolitics.com that basically represents polling data that was collected in
the months leading up to the 2004, 2008, and 2012 US presidential elections was used.
 Each row in the data set represents a state in a particular election year.
 And the dependent variable, which is called Republican, is a binary outcome
 It's 1 if the Republican won that state in that particular election year, and a 0 if a Democrat
won
 The independent variables, again, are related to polling data in that state.
 So for instance, the Rasmussen and SurveyUSA variables are related to two major polls that
are assigned across many different states in the United States.
 And it represents the percentage of voters who said they were likely to vote Republican
minus the percentage who said they were likely to vote Democrat. So for instance, if the
variable SurveyUSA in our data set has value -6, it means that 6% more voters said they were
likely to vote Democrat than said they were likely to vote Republican in that state.
 DiffCount counts the number of all the polls leading up to the election that predicted a
Republican winner in the state, minus the number of polls that predicted a Democratic
winner.
 And PropR, or proportion Republican, has the proportion of all those polls leading up to the
election that predicted a Republican winner.
 Also, as it is not known as to what model would be a better fit in this case, we will first run
regressions for data of 2004 and 2008, decide on the best model and then apply it to 2012
data.
 EXPLANATION AND RESULTS
 Herein, I’ll train data from the 2004 and 2008 elections, and test on data from the 2012
presidential election.
 First of all, a data frame called Train is created using the subset function that breaks down
the original polling data frame and only stores the observations when either the Year was
2004 or when the Year was 2008
 Another subset, ‘Test’ would be created to save the values for 2012 polling data
 Next, we need to understand the prediction of our baseline model against which we would
compare the logistic regression model
 For this, we need to check the breakdown of the dependent variable in the training set

 Interpretation : It is reflective of the fact that in 47 out of the 100 observations, the
democrats won the elections and in 53 Republicans won the elections.
 So, it can be stated that the baseline model is going to predict the more common outcome
,i.e., Republicans won in the state and will have 53% accuracy in the results.
 It is a weak model because it will predict Republican even for the state where Democrats
have higher chances of winning or are polling 15 to 20% ahead of Republicans.
 So, a better baseline model would be one which considers only one state. In this case,
Rasmussen should be considered and prediction should be based on as to which party was
picked out by the polls as the actual winner in the particular state and then decide on the
ultimate result.
 In this case sign function should be used, which would return positive value, if Republicans
are winning, -1, if democrats are winning and 0 if the model is inconclusive or if there is a tie.
 Results of the baseline model :

 It is reflective of the fact that Republicans are predicted to win in 55 states, democrats in 42
states and the model was inconclusive with respect to 3 states.
 Comparison of Baseline model 1 with baseline model 2

 Here, 0 and 1 are depictive of the republic and democrat wins


 The results are depictive of the fact that for 42 states, the model 2 correctly predicted that
democrats will win and there were a total of 3 mistakes and 2 inconclusive results on the part
of the model2 where model 1 made 47 mistakes
 Model 2 is definitely better in this scenario.

LOGISTIC REGRESSION MODEL

 Before starting with regression, there is a possibility of multi-collinearity in the model.


 For that, we need to check the correlation of the independent variables with each other and
also with the response variable.
 The following results were obtained :

There is a high degree of correlation among the variables. So, we can consider involving only one
predictor at a time.

Starting with Prop R,

In the model 1, we will be predicting the probability of success of republicans on the basis of the
polling PropR.

The results are :

 Interpretation :
 For every 1 unit change in PropR, the log-odds of Republicans winning against democrats
increase by 11.390 and the results are also statistically significant.
 Next, we need to check the predictability prowess of this model
 Interpretation: In the columns, 0 shows that the democrat won and 1 shows that the
republicans won. ‘True’ shows that we predicted Republican and ‘False’ shows that we
predicted Democrat.
 The results are clearly depictive of the fact that we were incorrect at 4 places and this
accuracy is very close to the baseline as set.
 So, we need to further improvise on the model.
 TWO VARIABLE MODEL
 Going back to the correlation matrix, we will have to check for those variables which have
comparatively lower correlation with each other.
 For model 2, we will consider SurveyUSA and Diffcount as dependent variables and compute
the predictions.

 Interpretation : With a unit change in SurveyUS, the log-odds of Republicans against


democrats increase by 0.2976
 With a unit change in the Diffcount, the log-odds of Republicans against democrats increase
by 0.76
 Both the results are statistically significant.
 Also, AIC which is used for evaluation of the model has lower value as compared to the
previous model and hence, shows that it is a better fit.
 Prediction Results :

You might also like