Providing information to test takers and test score users about the abilities of test takers at d... more Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement. Scale anchoring, a technique which describes what students at different points on a score scale know and can do, is a tool to provide such information. Scale anchoring for a test involves a substantial amount of work, both by the statistical analysts and test developers involved with the test. In addition, scale anchoring involves considerable use of subjective judgment, so its conclusions may be questionable. We describe statistical procedures that can be used to determine if scale anchoring is likely to be successful for a test. If these procedures indicate that scale anchoring is unlikely to be successful, then there is little reason to perform a detailed scale anchoring study. The procedures are applied to several data sets from a teachers' licensing test.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
Providing information to test takers and test score users about the abilities of test takers at d... more Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement (Carroll, 1993). Scale anchoring (Beaton & Allen, 1992), a technique that describes what students at different points on a score scale know and can do, is a tool to provide such information. Scale anchoring for a test involves substantial amount of work, both by the statistical analysts and test developers involved with the test. In addition, scale anchoring involves considerable use of subjective judgment, so its conclusions may be questionable. This paper describes statistical procedures that can be used to determine if scale anchoring is likely to be successful for a test. If these procedures indicate that scale anchoring is unlikely to be successful, then there is little reason to perform a detailed scale anchoring study. The procedures are applied to several data sets from a teacher licensing test.
The use of computer-based assessments makes the collection of detailed data that capture examinee... more The use of computer-based assessments makes the collection of detailed data that capture examinees’ progress in the tests and time spent on individual actions possible. This article presents a study using process and timing data to aid understanding of an international language assessment and the examinees. Issues regarding test-taking strategies, test speededness, test design, and their relationship to examinees’ demographic backgrounds and performance are also discussed.
Providing information to test takers and test score users about the abilities of test takers at d... more Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement (Carroll, 1993). Scale anchoring (Beaton & Allen, 1992), a technique that describes what students at different points on a score scale know and can do, is a tool to provide such information. Scale anchoring for a test involves substantial amount of work, both by the statistical analysts and test developers involved with the test. In addition, scale anchoring involves considerable use of subjective judgment, so its conclusions may be questionable. This paper describes statistical procedures that can be used to determine if scale anchoring is likely to be successful for a test. If these procedures indicate that scale anchoring is unlikely to be successful, then there is little reason to perform a detailed scale anchoring study. The procedures are applied to several data sets from a teacher licensing test.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
In many educational tests, both multiple-choice (MC) and constructed-response (CR) sections are u... more In many educational tests, both multiple-choice (MC) and constructed-response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form-specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long-standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
Educational testing intends to measure person's ability, and tests are designed for various o... more Educational testing intends to measure person's ability, and tests are designed for various objectives. While the ideal pure power tests and pure speed tests do not meet the practical needs in test administration, speeded tests become the primary option. The tradeoff between speed and accuracy is an important issue, especially in speeded tests, and being able to collect and analyze item response time, which is the amount of time an examinee devotes to a particular item until answering it correctly, provides an opportunity to investigate the relationship between response accuracy and response times. A few methods have been proposed to capture such relationship. This dissertation addresses the issue in the context of computer-based testing (CBT), where response times can be automatically recorded. An important component, time limits, in speeded tests is considered. The proposed mixture model incorporates both (potential) response accuracy and response times. In the spirit of speeded tests, it takes into account the possibility that some examinees are not potentially capable of solving a certain item, as in a pure power test, and decomposes response times into finite response times for those who are potentially capable and infinite response times for others. Time limits are represented in terms of censoring: some response times are censored because the problem solving procedures are terminated due to time limits. The proportion of examinees who are potentially capable of solving a particular item is modeled by a regular Item Response Theory (IRT) model. The parameter estimation method is first validated in extensive simulation studies, and the results of real data analysis indicate that the proposed mixture model outperforms the methods in which the impact of time limits is ignored.
Journal of Educational and Behavioral Statistics, Mar 8, 2021
In many educational assessments, items are reused in different administrations throughout the lif... more In many educational assessments, items are reused in different administrations throughout the life of the assessments. Ideally, a reused item should perform relatively similarly over time. In reality, an item may become easier with exposure, especially when item preknowledge has occurred. This article presents a novel cumulative sum procedure for detecting item preknowledge in continuous testing where data for each reused item may be obtained from small and varying sample sizes across administrations. Its performance is evaluated with simulations and analytical work. The approach is effective in detecting item preknowledge quickly with group size at least 10 and is easy to implement with varying item parameters. In addition, it is robust to the ability estimation error introduced in the simulations.
For assessments that use different forms in different administrations, equating methods are appli... more For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be unforeseen. The situation is more challenging for assessments that assemble many different forms and deliver frequent administrations per year. Harmonic regression, a seasonal-adjustment method, has been found useful in achieving the goal of differentiating between possible known sources of variability and unknown sources so as to study score stability for such assessments. As an extension, this paper presents a family of three approaches that incorporate examinees' demographic data into harmonic regression in different ways. A generic evaluation method based on jackknifing is developed to compare the approaches within the family. The three approaches are compared using real data from an international language assessment. Results suggest that all approaches perform similarly and are effective in meeting the goal. The paper also discusses the properties and limitations of the three approaches, along with inferences about score (in)stability based on the harmonic regression results.
Many large-scale standardized tests are intended to measure skills related to ability rather than... more Many large-scale standardized tests are intended to measure skills related to ability rather than the rate at which examinees can work. Time limits imposed on these tests make it difficult to distinguish between the effect of low proficiency and the effect of lack of time. This paper proposes a mixture cure-rate model approach to address this issue. Maximum likelihood estimation is proposed for parameter and variance estimation for three cases: when examinee parameters are to be estimated given precalibrated item parameters, when item parameters are to be calibrated given known examinee parameters, and when item parameters are to be estimated without assuming known examinee parameters. Large-sample properties are established for the cases under suitable regularity conditions. Simulation studies suggest that the proposed approach is appropriate for inferences concerning model parameters. In addition, not distinguishing between the effect of low proficiency and the effect of lack of time is shown to have considerable consequences for parameter estimation. A real data example is presented to demonstrate the new model. Choice of survival models for the latent power times is also discussed.
A multistage test (MST) is a computer-‐based assessment that may be thought of as a compromise b... more A multistage test (MST) is a computer-‐based assessment that may be thought of as a compromise between a linear test and a computer-‐adaptive test (CAT). As such, MSTs may be vulnerable to at least some of the major security threats associated with each of these types of test (e.g., copying for linear tests and item pre-‐knowledge for CATs). The degree of vulnerability of any particular MST to these threats, as well as others, will depend (among other things) on details of the MST assembly and administration design. To supplement these preventative measures, routine statistical monitoring of response and timing data for items, modules and tests, as well as the screening of performance of individual test takers and clusters of test takers, is essential. We strongly believe that test security procedures are properly understood in terms of quality control for a testing program, and that the goal of these procedures should be for the program to report only valid test scores, while tr...
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
In many educational tests, both multiple-choice (MC) and constructed-response (CR) sections are u... more In many educational tests, both multiple-choice (MC) and constructed-response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form-specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long-standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.
The use of computer-based assessments makes the collection of detailed data that capture examinee... more The use of computer-based assessments makes the collection of detailed data that capture examinees’ progress in the tests and time spent on individual actions possible. This article presents a study using process and timing data to aid understanding of an international language assessment and the examinees. Issues regarding test-taking strategies, test speededness, test design, and their relationship to examinees’ demographic backgrounds and performance are also discussed.
The method of maximum-likelihood is typically applied to item response theory (IRT) models when t... more The method of maximum-likelihood is typically applied to item response theory (IRT) models when the ability parameter is estimated while conditioning on the true item parameters. In practice, the item parameters are unknown and need to be estimated first from a calibration sample. Lewis (1985) and Zhang and Lu (2007) proposed the expected response functions (ERFs) and the corrected weighted-likelihood estimator (CWLE), respectively, to take into account the uncertainty regarding item parameters for purposes of ability estimation. In this paper, we investigate the performance of ERFs and of the CWLE in different situations, such as various test lengths and levels of measurement error in item parameter estimation. Our empirical results indicate that ERFs can cause the bias in ability estimation to fall within [−0.2, 0.2] for all conditions, whereas the CWLE can effectively reduce the bias in ability estimation provided that it has a good foundation to start from.
Providing information to test takers and test score users about the abilities of test takers at d... more Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement. Scale anchoring, a technique which describes what students at different points on a score scale know and can do, is a tool to provide such information. Scale anchoring for a test involves a substantial amount of work, both by the statistical analysts and test developers involved with the test. In addition, scale anchoring involves considerable use of subjective judgment, so its conclusions may be questionable. We describe statistical procedures that can be used to determine if scale anchoring is likely to be successful for a test. If these procedures indicate that scale anchoring is unlikely to be successful, then there is little reason to perform a detailed scale anchoring study. The procedures are applied to several data sets from a teachers' licensing test.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
Providing information to test takers and test score users about the abilities of test takers at d... more Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement (Carroll, 1993). Scale anchoring (Beaton & Allen, 1992), a technique that describes what students at different points on a score scale know and can do, is a tool to provide such information. Scale anchoring for a test involves substantial amount of work, both by the statistical analysts and test developers involved with the test. In addition, scale anchoring involves considerable use of subjective judgment, so its conclusions may be questionable. This paper describes statistical procedures that can be used to determine if scale anchoring is likely to be successful for a test. If these procedures indicate that scale anchoring is unlikely to be successful, then there is little reason to perform a detailed scale anchoring study. The procedures are applied to several data sets from a teacher licensing test.
The use of computer-based assessments makes the collection of detailed data that capture examinee... more The use of computer-based assessments makes the collection of detailed data that capture examinees’ progress in the tests and time spent on individual actions possible. This article presents a study using process and timing data to aid understanding of an international language assessment and the examinees. Issues regarding test-taking strategies, test speededness, test design, and their relationship to examinees’ demographic backgrounds and performance are also discussed.
Providing information to test takers and test score users about the abilities of test takers at d... more Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement (Carroll, 1993). Scale anchoring (Beaton & Allen, 1992), a technique that describes what students at different points on a score scale know and can do, is a tool to provide such information. Scale anchoring for a test involves substantial amount of work, both by the statistical analysts and test developers involved with the test. In addition, scale anchoring involves considerable use of subjective judgment, so its conclusions may be questionable. This paper describes statistical procedures that can be used to determine if scale anchoring is likely to be successful for a test. If these procedures indicate that scale anchoring is unlikely to be successful, then there is little reason to perform a detailed scale anchoring study. The procedures are applied to several data sets from a teacher licensing test.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
In many educational tests, both multiple-choice (MC) and constructed-response (CR) sections are u... more In many educational tests, both multiple-choice (MC) and constructed-response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form-specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long-standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
Educational testing intends to measure person's ability, and tests are designed for various o... more Educational testing intends to measure person's ability, and tests are designed for various objectives. While the ideal pure power tests and pure speed tests do not meet the practical needs in test administration, speeded tests become the primary option. The tradeoff between speed and accuracy is an important issue, especially in speeded tests, and being able to collect and analyze item response time, which is the amount of time an examinee devotes to a particular item until answering it correctly, provides an opportunity to investigate the relationship between response accuracy and response times. A few methods have been proposed to capture such relationship. This dissertation addresses the issue in the context of computer-based testing (CBT), where response times can be automatically recorded. An important component, time limits, in speeded tests is considered. The proposed mixture model incorporates both (potential) response accuracy and response times. In the spirit of speeded tests, it takes into account the possibility that some examinees are not potentially capable of solving a certain item, as in a pure power test, and decomposes response times into finite response times for those who are potentially capable and infinite response times for others. Time limits are represented in terms of censoring: some response times are censored because the problem solving procedures are terminated due to time limits. The proportion of examinees who are potentially capable of solving a particular item is modeled by a regular Item Response Theory (IRT) model. The parameter estimation method is first validated in extensive simulation studies, and the results of real data analysis indicate that the proposed mixture model outperforms the methods in which the impact of time limits is ignored.
Journal of Educational and Behavioral Statistics, Mar 8, 2021
In many educational assessments, items are reused in different administrations throughout the lif... more In many educational assessments, items are reused in different administrations throughout the life of the assessments. Ideally, a reused item should perform relatively similarly over time. In reality, an item may become easier with exposure, especially when item preknowledge has occurred. This article presents a novel cumulative sum procedure for detecting item preknowledge in continuous testing where data for each reused item may be obtained from small and varying sample sizes across administrations. Its performance is evaluated with simulations and analytical work. The approach is effective in detecting item preknowledge quickly with group size at least 10 and is easy to implement with varying item parameters. In addition, it is robust to the ability estimation error introduced in the simulations.
For assessments that use different forms in different administrations, equating methods are appli... more For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be unforeseen. The situation is more challenging for assessments that assemble many different forms and deliver frequent administrations per year. Harmonic regression, a seasonal-adjustment method, has been found useful in achieving the goal of differentiating between possible known sources of variability and unknown sources so as to study score stability for such assessments. As an extension, this paper presents a family of three approaches that incorporate examinees' demographic data into harmonic regression in different ways. A generic evaluation method based on jackknifing is developed to compare the approaches within the family. The three approaches are compared using real data from an international language assessment. Results suggest that all approaches perform similarly and are effective in meeting the goal. The paper also discusses the properties and limitations of the three approaches, along with inferences about score (in)stability based on the harmonic regression results.
Many large-scale standardized tests are intended to measure skills related to ability rather than... more Many large-scale standardized tests are intended to measure skills related to ability rather than the rate at which examinees can work. Time limits imposed on these tests make it difficult to distinguish between the effect of low proficiency and the effect of lack of time. This paper proposes a mixture cure-rate model approach to address this issue. Maximum likelihood estimation is proposed for parameter and variance estimation for three cases: when examinee parameters are to be estimated given precalibrated item parameters, when item parameters are to be calibrated given known examinee parameters, and when item parameters are to be estimated without assuming known examinee parameters. Large-sample properties are established for the cases under suitable regularity conditions. Simulation studies suggest that the proposed approach is appropriate for inferences concerning model parameters. In addition, not distinguishing between the effect of low proficiency and the effect of lack of time is shown to have considerable consequences for parameter estimation. A real data example is presented to demonstrate the new model. Choice of survival models for the latent power times is also discussed.
A multistage test (MST) is a computer-‐based assessment that may be thought of as a compromise b... more A multistage test (MST) is a computer-‐based assessment that may be thought of as a compromise between a linear test and a computer-‐adaptive test (CAT). As such, MSTs may be vulnerable to at least some of the major security threats associated with each of these types of test (e.g., copying for linear tests and item pre-‐knowledge for CATs). The degree of vulnerability of any particular MST to these threats, as well as others, will depend (among other things) on details of the MST assembly and administration design. To supplement these preventative measures, routine statistical monitoring of response and timing data for items, modules and tests, as well as the screening of performance of individual test takers and clusters of test takers, is essential. We strongly believe that test security procedures are properly understood in terms of quality control for a testing program, and that the goal of these procedures should be for the program to report only valid test scores, while tr...
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its pr... more Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
In many educational tests, both multiple-choice (MC) and constructed-response (CR) sections are u... more In many educational tests, both multiple-choice (MC) and constructed-response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form-specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long-standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.
The use of computer-based assessments makes the collection of detailed data that capture examinee... more The use of computer-based assessments makes the collection of detailed data that capture examinees’ progress in the tests and time spent on individual actions possible. This article presents a study using process and timing data to aid understanding of an international language assessment and the examinees. Issues regarding test-taking strategies, test speededness, test design, and their relationship to examinees’ demographic backgrounds and performance are also discussed.
The method of maximum-likelihood is typically applied to item response theory (IRT) models when t... more The method of maximum-likelihood is typically applied to item response theory (IRT) models when the ability parameter is estimated while conditioning on the true item parameters. In practice, the item parameters are unknown and need to be estimated first from a calibration sample. Lewis (1985) and Zhang and Lu (2007) proposed the expected response functions (ERFs) and the corrected weighted-likelihood estimator (CWLE), respectively, to take into account the uncertainty regarding item parameters for purposes of ability estimation. In this paper, we investigate the performance of ERFs and of the CWLE in different situations, such as various test lengths and levels of measurement error in item parameter estimation. Our empirical results indicate that ERFs can cause the bias in ability estimation to fall within [−0.2, 0.2] for all conditions, whereas the CWLE can effectively reduce the bias in ability estimation provided that it has a good foundation to start from.
Uploads
Papers by Yi-Hsuan Lee