Time Series: Scholarship Statistics
Time Series: Scholarship Statistics
Time Series: Scholarship Statistics
Time Series
In time series, the values present can be indexed: when a base value is set (100/1000) and the values
since, or before that point in time are shown as changes (e.g. CPI).
- Relate your statistical analysis to the practical context of the data with appropriate units.
- Evaluate the reliability of the data source.
- Consider variability within the data; constant/changing.
1. Trend component- tendency to increase, decrease, or stay steady. These are the smoothed
values of the time series. Trend may have unusual features: any outliers, can it be split into
piecewise model? Is this trend likely to continue?
2. Cyclical component– longer-time variations in time-series data that repeat over time, such
as a wave-shaped curve, showing alternating periods of expansion and contraction. Cyclical
components are difficult to analyse (e.g. waves may have differing periods) and at this level
are described along with trend.
NOTE: Seasonality is always of a fixed and known period. Hence, seasonal time series are
sometimes called periodic time series. A cyclic pattern exists when data exhibit rises and falls
that are not of fixed period.
3. Seasonal component (Seasonality) - repeating variations due to periodic influences over the
same time period.
- Individual seasonal effect: difference between a raw value and the smoothed value for that
season.
- Average seasonal effect: a smoothed value for the individual seasonal effects.
- Seasonally adjusted value/de-seasonalised data value: raw value with the average seasonal
effect accounted for, to see how much it differs from what is expected.
4. Irregularities – erratic or residual variations (noise) that are not accounted for by trend,
cycle or seasonal components of the time series. A residual is the difference between the
individual seasonal effect and the average seasonal effect for a season.
Moving Averages
This is how the trend line is created, by smoothing the data to supress the seasonality. This is when
average of successive data values are calculated and plotted.
The Seasonal Trend Lowess Model is most commonly used, where the data points further away from
the point being plotted are given less weight.
Variation as Percentages
The range of each trend, seasonal and residual component should be expressed as a percentage of
the range of the raw data. This will help determine which component has the most impact on the
data. (component range % / raw data range %)
Forecasts
The Holt-Winters Additive model is used to create predictions/forecasts. This model assumes that
the seasonality is reasonably consistent. If the seasonality is inconsistent or there is a cyclical pattern
instead, a multiplicative model may be better. (Just mention this)
The Holt-Winters process applies more weight to the most recent data points. It smooths each of the
three components (mean is smoothed to give a local average value, trend is smoothed, each
seasonal sub series is smoothed e.g. all January values, all February values, to give a seasonal
estimate for each season) and extrapolates beyond the range of the data. A 95% prediction interval
is produced. Forecasts close to the data set will be more reliable than further in the future.
Reliability of forecasts will depend on variability within the data and the degree to which the
features (seasonality etc.) fir the model. To investigate reliability, remove the last few data points,
‘predict’ them from forecasts and check if those predictions are close to the raw data you removed.
There will always be an element of chance that can affect predictions.
Indexing
Often done with monetary values to remove effects of inflation. The base period is given a round
number index, such as 100 or 1000. Changes from the base period are then expressed as percentage
change from that point in time using an index.
E.g. 2012, Q1 is set as the base period at 1000. If the index for 2018 Q3 is 1035, it means that the
value being measured has risen by 3.5% (35/1000) since the base period. If the new index is 925, it
means the value being measured has fallen by 7.5% (75/1000) since the base period.
A value for a particular time period can be deflated by dividing the value by the index for that time
period, then multiplying by the index of the base period. This allows values to be compared in real
terms. E.g. March 91, food price value = $1705, with CPI 884 (11.6% below base period). In real
terms: 1705 / 884 x 1000 = $1928.73
Bivariate Data
Two variables, making a comparison between these two and looking for a possible relationship
between them.
- Trend: linear/non-linear?
- Association: positive or negative?
- Relationship strong, moderate, weak, none?
- Scatter: Consistent? Where?
- Outliers: any unusual values, do they affect the trend line?
- Groups: any groupings or clusters? Where?
A calculated line of best fit. The least squares regression line is created to summarise the linear
relationship/trend between the two variables. This is the line with the smallest sum of the squares of
the residuals (difference between a raw data value and the trend line value).
- The gradient tells us the average change in the response (dependent) for each 1 unit
increase of the explanatory (independent) value.
- Y-intercept shows what to expect when the explanatory value is 0.
Making Predictions
Predictions of values of response (dependent) variables for various values of the explanatory
variable can be made by substituting into the equation of the trend line. The confidence you have for
the reliability of the prediction depends on the amount of scatter present for the trend line. For a
certain x value, there may be a significant range of y values in the raw data, increasing the
uncertainty of the prediction.
Furthermore, is the prediction interpolation (prediction made within the range of explanatory values
of the data set) or extrapolation (predictions outside the range of values of the explanatory variable
of the data set). When extrapolating, care should be taken to consider if it is reasonable to expect
that the trend in the data set can be continued outside the data set.
The r-value (Pearson’s product-moment correlation coefficient) and is between 1 and -1.
Shows how strong the correlation is overall (the strength and direction of the linear relationship
between two variables).
0 = no correlation
1 or -1 = perfect correlation
If relationship is non-linear, the coefficient of determination R^2 is used instead. It is the proportion
of the variation in the response variable that can be explained by the regression model.
Residuals – this plots how far each of the data points for the response variable are from the
regression line.
If two variables are correlated, then changes in the values of one variable are associated with
changes in the values of the other variable.
If two variables have a causal relationship, then changes in the values of one variable cause changes
in the values of the other variable.
Correlation does not imply causality – there may be another lurking variable which is affecting the
values of both variables.
A random sample of sufficient size forma well-defined population will probably allow for
generalisations from the relationship observed in the sample to a wider context.
However, if a sample of bivariate data has special characteristics, the findings will probably only
apply to a population with those characteristics, and there may be limited opportunities to
generalise findings to a wider population – thus limiting the usefulness of the investigation.