Battery Test
Battery Test
Battery Test
CRITICAL TEST
BATTERY
technical
9
manual
Test Batteries
c
ONTENTS
1
5
THEORETICAL OVERVIEW
REFERENCES
APPENDICES
ADMINISTRATION INSTRUCTIONS
SCORING INSTRUCTIONS
CORRECTION FOR GUESSING
NORM TABLES
2
LIST OF TABLES 3
1 MEAN AND SD OF AGE, AND GENDER BREAKDOWN, OF THE NORMATIVE SAMPLE
FOR THE VCR2
2 MEAN SCORES FOR MEN AND WOMEN (MBAS) ON THE VCR2 AND NCR2
3 ALPHA COEFFICIENTS FOR THE VERBAL AND NUMERICAL CRITICAL REASONING
TESTS
4 CORRELATIONS BETWEEN THE VCR2 AND NCR2
5 CORRELATIONS BETWEEN THE VERBAL AND NUMERICAL CRITICAL REASONING
TESTS WITH THE APIL-B
6 CORRELATIONS BETWEEN THE ORIGINAL VERSIONS OF THE VCR2 AND NCR2
WITH THE AH5
7 ASSOCIATION BETWEEN THE VCR2, NCR2 AND INSURANCE SALES SUCCESS
8 CORRELATIONS BETWEEN THE VCR2, NCR2 AND MBA PERFORMANCE
4
1
THEORETICAL
OVERVIEW
1
2
THE ROLE OF PSYCHOMETRIC TESTS
IN PERSONNEL SELECTION AND
ASSESSMENT
2
THE DEVELOPMENT OF THE
CRITICAL REASONING TESTS
REVISIONS FOR THE SECOND EDITION
THE DEVELOPMENT
OF THE CRITICAL
bm REASONING TESTS
Research has clearly demonstrated In constructing the items in the
that in order to accurately assess VCR2 and NCR2 a number of guide
reasoning ability it is necessary to lines were borne in mind. Firstly,
develop tests which have been and perhaps most importantly,
specifically designed to measure that special care was taken when writing
ability in the population under the items to ensure that in order to
consideration. That is to say, we correctly solve each item it was
need to be sure that the test has been necessary to draw logical conclusions
developed for use on the particular and inferences from the stem
group being tested, and thus is passage/table. This was done to
appropriate for that particular ensure that the test was assessing
group. There are two ways in which critical (logical/deductive) reasoning
this is important. Firstly, it is impor- rather than simple verbal/numerical
tant that the test has been developed checking ability. That is to say, the
in the country in which it is intended items assess a person’s ability to
to be used. This ensures that the think in a rational, critical way and
items in the test are drawn from a make logical inferences from verbal
common, shared cultural experience, and numerical information, rather
giving each candidate an equal than simply check for factual errors
opportunity to understand the logic and inconsistencies.
which underlies each item. Secondly, In order to achieve this goal for
it is important that the test is the Verbal Critical Reasoning
designed for the particular ability (VCR2) test two further points were
range on which it is to be used. A born in mind when constructing the
test designed for those of average stem passages for the VCR2. Firstly,
ability will not accurately distinguish the passages were kept fairly short
between people of high ability as all and cumbersome grammatical
the scores will cluster towards the constructions were avoided, so that a
top end of the scale. Similarly, a test person’s scores on the test would not
designed for people of high ability be too affected by reading speed;
will be of little use if given to people thus providing a purer measure of
of average ability. Not only will it not critical reasoning ability. Secondly,
discriminate between applicants, as care was taken to make sure that the
all the scores will cluster towards the passages did not contain any infor-
bottom of the scale, but also as the mation which was counter-intuitive,
questions will be too difficult for and was thus likely to create confu-
most of the applicants they are likely sion.
to be de-motivated, producing artifi- To increase the acceptability of
cially low scores. Consequently, the the test to applicants the themes of
VCR2 and NCR2 have been devel- the stem passages were chosen to be
oped on data from undergraduates. relevant to a wide range of business
That is to say, people of above situations. As a consequence of these
average intelligence, who are likely constraints the final stem passages
to find themselves in senior manage- were similar in many ways to the
ment positions as their career short articles found in the financial
develops. pages of a daily newspaper, or trade
magazines.
REVISIONS FOR THE
SECOND EDITION bn
The second edition of the Verbal and must have incorrectly guessed the
Numerical Critical Reasoning tests answer to that item. We can further
has been revised To meet the follow- assume that, by chance, the respon-
ing aims: dent incorrectly guessed
● To impove the face validity of the the answer 66% of the time and
test items, thus increasing the correctly guessed the answer 33%
test’s acceptability to respondents. of the time. Thus it is possible to
● To modernise the items to reflect estimate the number of correct
contemporary business and finan- guesses the respondent made from
cial issues. the number of incorrect responses.
● To improve the tests’ reliability This correction can then be
and validity while maintaining subtracted from the total score to
the tests’ brevity – with the CRBT adjust for the number of items the
being administrable in under one respondent is likely to have correctly
hour. guessed.
● To simplify test scoring. The use of this correction
● To make available a hand scored improves the test’s score distribution,
as well as a computer scored increasing its power to discriminate
version of the tests. between the respondents’ ‘true’
● To remove the impact of guessing ability level. Thus it is recommended
on raw VCR2scores, thus increas- that test users correct sores for
ing the power of the VCR2 to guessing before standardising scores.
discriminate between respon- However, as the norm tables for
dents. corrected and uncorrected scores are
significantly different from each
As noted above the most significant other it is important, if hand scoring
change in the second edition of the the Critical Reasoning tests, to
VCR2 has been the incorporation of ensure that the correct norm table is
a correction for guessing. This obvi- used to standardise the scores on the
ates the problem that, due to the VCR2. That is to say, either the
three-point response scale that is norm table for the uncorrected
used in most verbal critical reasoning (Appendix IV - Table 2) or corrected
test, it is possible for respondents to scores (Appendix IV - Table 3)
get 33% of the items correct simply depending upon whether or not the
by guessing. correction for guessing has been
While a variety of methods have applied).
been proposed for solving this
problem (including the use of nega-
tive or harsh scoring criteria) we
believe that a correction for guessing
is the most elegant and practical
solution to this problem.
This correction is based on the
number of items the respondent gets
wrong on the test. We know that to
get these items wrong the respondent
bo
3 THE PSYCHOMETRIC
PROPERTIES OF THE
CRITICAL REASONING
TESTS
This chapter presents information
describing the psychometric proper-
ties of the Verbal and Numerical
Critical Reasoning tests. The aim will
be to show that these measures meet
the necessary technical requirements
with regard to standardisation, relia-
bility and validity, to ensure the psy-
chometric soundness of these test
1
2
3
4
5
6
7
INTRODUCTION
STANDARDISATION
BIAS
RELIABILITY OF THE CRITICAL
REASONING TESTS
VALIDITY
STRUCTURE OF THE CRITICAL
REASONING TESTS
CONSTRUCT VALIDITY OF THE
materials. CRITICAL REASONING TESTS
8 CRITERION VALIDITY OF THE CRITICAL
REASONING TESTS
bq INTRODUCTION
STANDARDISATION – RELIABILITY – ASSESSING
NORMATIVE STABILITY
Formative data allows us to compare Also known as test-retest reliability,
an individuals score on a standard- an assessment is made of the similar-
ised scale against the typical score ity of scores on a particular scale
obtained from a clearly identifiable, over two or more test occasions. The
homogeneous group of people. occasions may be from a few hours,
days, months or years apart.
Normally Pearson correlation coeffi-
cients are used to quantify the
RELIABILITY RELIABILITY simi-larity between the scale scores
The property of a measurement over the two or more occasions.
which assesses the extent to which Stability coefficients provide an
variation in measurement is due to important indicator of a test’s likely
true differences between people on usefulness of measurement. If these
the trait being measure or to coefficients are low (< approx. 0.6)
measurement error. In order to then it is suggestive of either that the
provide meaningful interpretations, abilities/behaviours/attitudes being
the reasoning tests were standardised measured are volatile or situationally
against a number of relevant groups. specific, or that over the duration of
The constituent samples are fully the retest interval, situational events
described in the next section. have made the content of the scale
Standardisation ensures that the irrelevant or obsolete. Of course, the
measurements obtained from a test duration of the retest interval
can be meaningfully interpreted in provides some clue as to which effect
the context of a relevant distribution may be causing the unreliability of
of scores. Another important techni- measurement. However, the second
cal requirement for a measure of a scales reliability also
psychometrically sound test is that provides valuable information as to
the measurements obtained from why a scale may have a low stability
that test should be reliable. coefficient. The most common
Reliability is generally assessed using measure of internal consistency is
two specific measures, one related to Cronbach’s Alpha. If the items on a
the stability of scale scores over time, scale have high inter-correlations
the other concerned with the internal with each other, and with the total
consistency, or homogeneity of the scale score, then coefficient alpha
constituent items that form a scale will be high. Thus a high coefficient
score. alpha indicates that the items on the
scale are measuring very much the
same thing, while a low alpha would
be suggestive of either scale items
measuring different attributes or the
presence of error.
br
RELIABILITY – ASSESSING VALIDITY VALIDITY – ASSESSING
INTERNAL CONSISTENCY The ability of a scale score to reflect CONSTRUCT VALIDITY
Also known as scale homogeneity, an what that scale is intended to Construct validity assesses whether
assessment is made of the ability of measure. Kline’s (1993) definition is the characteristic which a test is
the items in a scale to measure the ‘A test is said to be valid if it actually measuring is psychologically
same construct or trait. That is a measures what it claims to measure’. meaningful and consistent with the
parameter can be computed that Validation studies of a test investi- tests definition. The construct valid-
indexes how well the items in a scale gate the soundness and relevance of ity of a test is assessed by
contribute to the overall measure- a proposed interpretation of that test. demonstrating that the scores from
ment denoted by the scale score. A Two key areas of validation are the test are consistent with those
scale is said to be internally consis- known as criterion validity and from other major tests which
tent if all the constituent item construct validity. measure similar constructs and are
responses are shown to be positively dissimilar to scores on tests which
associated with their scale score. The measure different constructs.
fact that a test has high internal
consistency & stability coefficients VALIDITY – ASSESSING
only guarantees that it is measuring CRITERION VALIDITY
something consistently. It provides Criterion validity involves translating
no guarantee that the test is actually a score on a particular test into a
measuring what it purports to prediction concerning what could be
measure, nor that the test will prove expected if another variable was
useful in a particular situation. observed. The criterion validity of a
Questions concerning what a test test is provided by demonstrating
actually measures and its relevance that scores on the test relate in some
in a particular situation are dealt meaningful way with an external
with by looking at the tests validity. criterion. Criterion validity comes in
Reliability is generally investigated two forms – predictive & concurrent.
before validity as the reliability of Predictive validity assesses whether a
test places an upper limit on tests test is capable of predicting an
validity. It can be mathematically agreed criterion which will be avail-
demonstrated that a validity coeffi- able at some future time – e.g. can a
cient for a particular test can not test predict the likelihood of someone
exceed that tests reliability coeffi- successfully completing a training
cient. course. Concurrent validity assesses
whether the scores on a test can be
used to predict a criterion measure
which is available at the time of the
test – e.g. can a test predict current
job performance.
bs STANDARDISATION
The critical reasoning tests were recommended that scores on the
standardised on a mixed sample of VCR2 are corrected for guessing.
365 people drawn from graduate, The correction for guessing should
managerial and professional groups. be applied to the raw score (i.e. to
The age and sex breakdowns of the the score before it has been stan-
normative sample for the VCR2 and dardised.) The corrected (or
NCR2 are presented in Tables 1 and uncorrected) raw score is then stan-
2 respectively. As would be expected darised with reference to the
from an undergraduate sample the appropriate norm table (Appendix
age distribution is skewed to the IV Table 2 for uncorrected scores
younger end of the age range of the and Table 3 for corrected scores.)
general population. The sex distribu- Thus it is important that particu-
tion is however broadly consistent lar care is taken to refer to the
with that found in the general popu- correct norm table when standra-
lation. dising VCR2 raw scores.
Norm tables for the VCR2 and In addition, for users of the
NCR2 are presented in Appendix IV. GeneSys system normative data is
For the Verbal Critical Reasoning available also from within the soft-
test different norm tables are ware, which computes for any given
presented for test scores that have, or raw score, the appropriate standard-
have not, been corrected for guess- ised scores for the selected reference
ing. (A correction for guessing has group. In addition the GeneSys™
not been made available for the software allows users to establish
Numerical Critical Reasoning test as their own in-house norms to allow
the six-point scale this test uses miti- more focused comparison with
gates against the problem of profiles of specific groups.
guessing.) As noted above it is
BIAS
GENDER AND AGE women on the verbal and numerical
DIFFERENCES critical reasoning tests, along with
Gender differences on CRTB were the F-ratio for the difference
examined by comparing samples of between theses means. While the
males and female respondents men in this sample obtained margin-
matched, for educational and socio- ally higher scores on both the verbal
economic status. Table 2 opposite and numerical reasoning tests, this
provides mean scores for men and was not statistically significant.
bt
Age Age Male Female
Mean SD
mean
men women Significance
(n=218) (n=166) F-ratio of difference
Table 2 – Mean scores for men and women (MBAs) on the VCR2 and NCR2
RELIABILITY OF THE
CRITICAL REASONING
ck TESTS
If a reasoning test is to be used for many personality tests are consid-
selection and assessment purposes ered to have acceptable levels of
the test needs to measure each of the reliability if they have reliability
aptitude or ability dimensions it is coefficients in excess of .7, reasoning
attempting to measure reliably, for tests should have reliability coeffi-
the given population (e.g. graduate cients in excess of .8.
entrants, senior managers etc.). That
is to say, the test needs to be consis- GRT2 INTERNAL CONSISTENCY
tently measuring each ability so that Table 3 presents alpha coefficients
if the test were to be used repeatedly for the Verbal and Numerical
on the same candidate it would Critical Reasoning tests. Each of
produce similar results. It is gener- these reliability coefficients is
ally recognised that reasoning tests substantially greater than .8, clearly
are more reliable than personality demonstrating that the VCR2 and
tests and for this reason high stan- NCR2 are highly reliable across a
dards of reliability are usually range of samples.
expected from such tests. While
VALIDITY
Whereas reliability assess the degree of measurement error of a reasoning
test, that is to say the extent to which the test is consistently measuring one
underling ability or aptitude, validity addresses the question of whether or
not the scale is measuring the characteristic it was developed to measure.
This is clearly of key importance when using a reasoning test for assessment
and selection purposes. In order for the test to be a useful aid to selection we
need to know that the results are reliable and that the test is measuring the
aptitude it is supposed to be measuring. Thus after we have examined a test’s
reliability we need to address the issue of validity. We traditionally examine
the reliability of a test before we explore its validity as reliability sets the
lower bound of a scale’s validity. That is to say a test cannot be more valid
than it is reliable.
STRUCTURE OF THE
CRITICAL REASONING
TESTS cl
Specifically we are concerned that ing ability. Moreover, we would
the tests are correlated with each expect the Verbal and Numerical
other in a meaningful way. For Critical Reasoning tests Not to be so
example, we would expect the Verbal highly correlated with each other as
and Numerical Critical Reasoning to suggest that they are measuring
tests be moderately correlated with the same construct (i.e. we would
each other as they are measuring expect the VCR2 and NCR2 to show
different facets of critical reasoning discriminant validity). Consequently,
ability – namely verbal and numeri- the first way in which we might
cal ability. Thus if the VCR2 and assess the validity of a reasoning test
NCR2 were not correlated with each is by exploring the relationship
other we might wonder whether each between the tests.
is a good measure of critical reason-
Insurance
Sales Agents MBA’s Undergraduates
(n=132) (n=170) (n=70)
VCR2 NCR2
Verbal/Numerical
subscale of the AH5 .60 .51
Mean Mean
(n=29) (n=23) t-value p
unsuccessful successful
Table 7 – Association between the VCR2, NCR2 and insurance sales success
cp
VCR2 NCR2
Innovation & design .374 (n=89, p< 01) .260 (n=89 p< .01)
Business decision making .467 (n=35, p<.01) .433 (n=35 p<.01)
Macro economics .478 (n=89, p<.001) .386 (n=89, p<.001)
IT .468 (n=35, p<.01) .511 (n=35 p<.01)
Post Graduate Diploma
in Business Administration
Average to date .364 (n=34, p<.05) .510 (n=34, p<.01)
Economics .236 (n=56, n.s.) .013 (n=56, n.s.)
Analytical Tools and
Techniques .312 (n=51, p<.05) .134 (n=51, n.s.)
Marketing .204 (n=53, n.s.) -.124 (n=53, n.s.)
Finance & Accounting .209 (n=56, n.s.) -.007 (n=56, n.s.)
Organisational Behaviour .296 (n=56, p<.05) -.032 (n=56, n.s.)
MBA Category .389 (n=48, p<.01) .109 (n=48, n.s.)
Heim, A.H. (1970). Intelligence and Barth Terman, L.M. et. al., (1917).
Personality. Harmondsworth, The Stanford Revision of the Binet-
Middlesex: Penguin. Simon scale for measuring
intelligence. Baltimore: Warwick and
Heim, A.H., Watt, K.P. and York
Simmonds, V. (1974). AH2/AH3
Group Tests of General Reasoning; Watson & Glaser (1980) Manual for
Manual. Windsor: NFER Nelson. the Watson-Glaser Critical Thinking
Appraisal Harcourt Brace
Jackson D.N. (1987) User’s Manual Jovanovich: New York
for the Multidimensional Aptitude
Battery London, Ontario: Research Yerkes, R.M. (1921). Psychological
Psychologists Press. examining in the United States
army. Memoirs of the National
Johnson, C., Blinkhorn, S., Wood, R. Academy of Sciences, 15.
and Hall, J. (1989). Modern
Occupational Skills Tests: User’s
Guide. Windsor: NFER-Nelson.
cs
5 APPENDIX I
ADMINISTRATION
INSTRUCTIONS
Good practice in test administration requires the assessor to set the scene
before the formal administration of the tests. This scene-setting generally
includes: welcome and introductions; the nature, purpose and use of the
assessment and feedback arrangements.
If only one (either the Verbal or Numerical) of the Critical Reasoning tests is
being administered then Say:
‘From now on, please do not talk among yourselves, but ask
me if anything is not clear. If you have a mobile phone
please ensure that it is switched off. We shall be doing only
one of the two tests contained in the booklet that I will
shortly be distributing.
Say either:
During the test I shall be checking to make sure you are not
making any accidental mistakes when filling in the answer
sheet. I will not be checking your responses to see if you are
answering correctly or not.’
If you are administering both the Verbal and Numerical Critical Reasoning
tests (as is more common), and if this is the first or only questionnaire being
administered give an introduction as per or similar to the example script
provided.
dk
Continue by using the instructions exactly as given. Say:
‘Print your last name and first name on the line provided,
and indicate your title and sex followed by your age and
today’s date.’
Explain to the respondents what to enter in the boxes marked ‘Test Centre’
and ‘Comments’. Walk round the room to check that the instructions are
being followed.
Then continue:
If you reach the ‘End of Test’ before time is called you may
review your answers if you wish.
If you have any questions please ask now, as you will not be
able to ask questions once the test has started.’
Then say very clearly:
‘Stop’
You should intervene if candidates continue after this point.
If you are only administering the Verbal Critical Reasoning test say:
‘We are now ready to start the n ext test. Has everyone still
got two sharpened pencils, an eraser, some unused rough
paper?’
If not, rectify, then say:
The say:
Point to the section on the answer sheet marked Example Questions (as you
read the above).
Then say:
If you reach the ‘End of Test’ before time is called you may
review your answers if you wish.
If you have any questions please ask now, as you will not be
able to ask questions once the test has started.’
Then say very clearly:
If you are only administering the Verbal Critical Reasoning test say:
To score and standardise the VCR2 follow steps 2-8. To score and standard-
ise the NCR2 follow steps 9-10.
2 Count up the number of correct responses for the VCR2 and enter the
total in the box marked ‘Total’ (Raw Score).
If you do not wish to correct the VCR2 score for guessing go straight to step 7.
3 To correct the VCR2 score for guessing add up the total number of
incorrect responses (i.e. the total number of items attempted minus the
raw score) and enter this in the box marked ‘Number Wrong’.
4 The correction for guessing can be found in Appendix III. The number
of incorrect responses is listed in the first column of this table and the
corresponding correction for guessing is listed in the second column.
Make note of the correction for guessing (that corresponds to the
number of incorrectly completed items).
5 To obtain the corrected raw score, subtract the correction for guessing
from the raw score. If this number is negative (i.e. the number corrected
for guessing is larger than the raw score) then the corrected raw score is
zero. Enter the corrected raw score in the box marked
‘Corrected/Uncorrected Raw Score’. To indicate that you have made the
correction, delete ‘Uncorrected’.
6 To standardise the corrected raw score, look this up in the norm table
presented in Appendix IV – Table 3 and enter this in the box marked
‘Standard Score’.
You have scored and standardised the VCR2. If you wish to score and stan-
dardise the NCR2 follow steps 9-10.
7 Enter the total score obtained from step 2 in the box marked
‘Corrected/Uncorrected Raw Score’. To indicate that you have not made
the correction, delete ‘Corrected’.
8 To standardise the uncorrected raw score, look this value up in the norm
table presented in Appendix IV – Table 2 and enter this in the box
marked ‘Standard Score’.
9 Count up the number of correct responses to the NCR2 and enter the
total in the box marked ’Total’.
10 To standardise the raw score, look this value up in the norm table
presented in Appendix IV – Table 1 and enter this in the box marked
‘Standard Score’.
APPENDIX III
CORRECTION FOR
GUESSING dr
Number of Correction
incorrect (to be deducted
answers from raw score)
1 .5
2 1
3 1.5
4 2
5 2.5
6 3
7 3.5
8 4
9 4.5
10 5
11 5.5
12 6
13 6.5
14 7
15 7.5
16 8
17 8.5
18 9
19 9.5
20 10
21 10.5
22 11
23 11.5
24 12
25 12.5
26 13
27
28
29
30
31
32 Corrected
33 Raw Score = 0
34
35
36
37
38
39
40
APPENDIX IV
ds NORM TABLES
Sten NCR2 Sten VCR2 Sten VCR2
Values Raw Values Raw Values Correct
1 0-2 1 0-7
2 3 2 8-10
3 4-5 3 11-12
Data not yet
4 6-7 4 13-16
available
5 8-10 5 17-20
6 11-13 6 21-23
7 14-16 7 24-27
8 17-18 8 28-29
9 19-20 9 30-32
10 21-25 10 33-40