A desirable goal would be to develop a methodology for scoring essays so that the final grades ar... more A desirable goal would be to develop a methodology for scoring essays so that the final grades are less affected by when or by whom each essay was read. It seems sensible to derive such grades by somehow adjusting the ratings originally given by each reader. This essay describes a solution that relies on statistical adjustment, using the context of the College Board's Advanced Placement program. Nonstatistical provisions, such as rater training, are in place to minimize the potential impact of rater differences on grades, but there is no simple way of getting a true score for an essay. The basii idea in using statistical thinking to help is to reduce the effect on scoring reliability of some of the sources of variability through calibrating readers and days on which essays are read. Estimating the relative stringency of raters and the scoring trends across time is made possible by the choice of experimental design developed by statisticians. An example illustrates the approach. Calibration experiments on five different Advanced Placement examinations showed that, in general, calibrated scores enhance reliability, but there are obstacles to overcome before the approach can be operationalized with actual essays. (Contains three tables and three references.) (SLD)
National Center for Education Statistics, Aug 1, 2006
The National Center for Education Statistics (NCES) is the primary federal entity for collecting,... more The National Center for Education Statistics (NCES) is the primary federal entity for collecting, analyzing, and reporting data related to education in the United States and other nations. It fulfills a congressional mandate to collect, collate, analyze, and report full and complete statistics on the condition of education in the United States; conduct and publish reports and specialized analyses of the meaning and significance of such statistics; assist state and local education agencies in improving their statistical systems; and review and report on education activities in foreign countries. NCES activities are designed to address high priority education data needs; provide consistent, reliable, complete, and accurate indicators of education status and trends; and report timely, useful, and high quality data to the U.S. Department of Education, the Congress, the states, other education policymakers, practitioners, data users, and the general public. Unless specifically noted, all information contained herein is in the public domain. We strive to make our products available in a variety of formats and in language that is appropriate to a variety of audiences. You, as our customer, are the best judge of our success in communicating information effectively. If you have any comments or suggestions about this or any other NCES product or report, we would like to hear from you. Please direct your comments to
Enhancing students' critical thinking (CT) skills is an essential goal of higher education. This ... more Enhancing students' critical thinking (CT) skills is an essential goal of higher education. This article presents a systematic approach to conceptualizing and measuring CT. CT generally comprises the following mental processes: identifying, evaluating, and analyzing a problem; interpreting information; synthesizing evidence; and reporting a conclusion. We further posit that CT also involves dealing with dilemmas involving ambiguity or conflicts among principles and contradictory information. We argue that performance assessment provides the most realistic-and most credible-approach to measuring CT. From this conceptualization and construct definition, we describe one possible framework for building performance assessments of CT with attention to extended performance tasks within the assessment system. The framework is a product of an ongoing, collaborative effort, the International Performance Assessment of Learning (iPAL). The framework comprises four main aspects: (1) The storyline describes a carefully curated version of a complex, real-world situation. (2) The challenge frames the task to be accomplished (3). A portfolio of documents in a range of formats is drawn from multiple sources chosen to have specific characteristics. (4) The scoring rubric comprises a set of scales each linked to a facet of the construct. We discuss a number of use cases, as well as the challenges that arise with the use and valid interpretation of performance assessments. The final section presents elements of the iPAL research program that involve various refinements and extensions of the assessment framework, a number of empirical studies, along with linkages to current work in online reading and information processing.
Psychologist Andrea diSessa coined the term "phenomenological primitives", or p-prims, to talk ab... more Psychologist Andrea diSessa coined the term "phenomenological primitives", or p-prims, to talk about nonexperts' reasoning about physical situations. P-prims are primitive in t h e sense that they stand without significant explanatory substructure or explanation. Examples are "Heavy objects fall faster than light objects" and "Continuing force is needed for continuing motion." P-prims are based on experience. They are not a coherent system; they may even contradict one another. People assemble from them a model of sorts to reason about a given situation. Intuitive physics is wrong from a physicist's point of view, but i t works just fine play fetch with your dog or push a couch across the room. It fails when you want to build a skyscraper or send a rocket to the moon. This paper considers p-prims t h a t underlie reasoning about assessment, the basis of what one might call intuitive test theory. Examples are "A test measures what it says at the top of the page," and "Scores from any two tests can be made interchangeable, with a little equating magic." Testing p-prims underlie discussions of test theory in the classroom, in the news, and in policy-making. Again, intuitive test theory works reasonably well for everyday uses like Friday's math quiz. It fails when you want to design an adaptive test, or measure the change in t h e proportion of students reading Above Basic from a matrix-sampled assessment such as NAEP.
This paper provides an historical overview of the philosophical, theoretical, and practical contr... more This paper provides an historical overview of the philosophical, theoretical, and practical contributions made by John Tukey to the field of simultaneous inference. The Problem of Multiple Comparisons, released in 1953, provided not only the first comprehensive account of the field but also set much of the research agenda for the next 35 years. During the last decade of his life, Tukey devoted substantial attention to this area, experimenting with different graphical representations of multiple comparison procedures and exploring the implications of the false discovery rate (FDR) approach to controlling family-wise error rates. In a number of publications, Tukey continued to grapple with the fundamental issues of the field and to identify critical problems to be addressed.
Publisher Summary A value-added model (VAM) refers to a family of statistical models that are emp... more Publisher Summary A value-added model (VAM) refers to a family of statistical models that are employed to make inferences about the effectiveness of educational units, usually schools and/or teachers. They are characterized by their focus on patterns in student score gains over time, rather than on student status at one point of time. In particular, they attempt to extract the estimates of the contributions of schools or teachers to student learning from the data on score trajectories. Before turning to the specifics of different VAMs, the chapter addresses a number of general issues that pertain to all such models. Each issue addresses a particular aspect of the use of VAMs in educational settings and leads to caution required in the interpretation and use of VAM results. The chapter presents two VAMs currently used operationally and two VAMs that have appeared in the research literature. It also discusses the future of this technology and policy implications of its use.
Journal of Educational and Behavioral Statistics, 1993
We present a novel approach to the empirical Bayes analysis of aggregated survival data from diff... more We present a novel approach to the empirical Bayes analysis of aggregated survival data from different groups of subjects. The method is based on a contingency table representation of the data and employs transformations to permit the use of normal priors. In contrast to the case of a single survival curve, the empirical Bayes analysis of families of such curves leads to estimates which offer a qualitative improvement over classical estimates based on the ratio of occurrence to exposure rates. This method is illustrated with data on the attainment of the doctoral degree from three major universities.
Skip to content. Taylor & Francis Online: Librarians; Authors & Editors; ... more Skip to content. Taylor & Francis Online: Librarians; Authors & Editors; Societies. Register; Sign in; Mobile. Home; Browse; Products; Redeem a voucher; Shortlist; Shopping Cart Cart. The online platform for Taylor & Francis Group content. Search. Advanced Search Within current journal Entire site. Home > List of Issues > Table of Contents > International Journal of Testing 2011 Reviewers. Browse journal. View all volumes and issues. Current issue. Most read articles. Most cited articles. Authors and submissions. Instructions for authors. Submit online. Subscribe ...
Employing nested sequences of models is a common practice when exploring the extent to which one ... more Employing nested sequences of models is a common practice when exploring the extent to which one set of variables mediates the impact of another set. Such an analysis in the context of logistic regression models confronts two challenges: (i) direct comparisons of coefficients across models are generally biased due to the changes in scale that accompany the changes in the set of explanatory variables, (ii) conducting a large number of tests induces a problem of multiplicity that can lead to spurious findings of significance if not heeded. This article aims to illustrate a practical strategy for conducting analyses in the face of these challenges. The challenges—and how to address them—are illustrated using a subset of the findings reported by Braun (Large-scale Assess Educ 6(4):1–52, 2018. 10.1186/s40536-018-0058-x), drawn from the Programme for the International Assessment of Adult Competencies (PIAAC), an international, large-scale assessment of adults. For each country in the data...
International Journal of Educational Methodology, 2021
This article introduces the concept of the carrying capacity of data (CCD), defined as an integra... more This article introduces the concept of the carrying capacity of data (CCD), defined as an integrated, evaluative judgment of the credibility of specific data-based inferences, informed by quantitative and qualitative analyses, leavened by experience. The sequential process of evaluating the CCD is represented schematically by a framework that can guide data analysis and statistical inference, as well as pedagogy. Aspects of each phase are illustrated with examples. A key initial activity in empirical work is data scrutiny, comprising consideration of data provenance and characteristics, as well as data limitations in light of the context and purpose of the study. Relevant auxiliary information can contribute to evaluating the CCD, as can sensitivity analyses conducted at the modeling stage. It is argued that early courses in statistical methods, and the textbooks they rely on, typically give little emphasis to, or omit entirely, discussion of the importance of data scrutiny in scie...
Teachers College Record: The Voice of Scholarship in Education, 2011
Background/context The National Assessment of Educational Progress (NAEP) is the only comparative... more Background/context The National Assessment of Educational Progress (NAEP) is the only comparative assessment of academic competencies regularly administered to nationally representative samples of students enrolled in Grades 4, 8, and 12. Because NAEP is a low-stakes assessment, there are long-standing questions about the level of engagement and effort of the 12th graders who participate in the assessment and, consequently, about the validity of the reported results. Purpose/Focus This study investigated the effects of monetary incentives on the performance of 12th graders on a reading assessment closely modeled on the NAEP reading test in order to evaluate the likelihood that scores obtained at regular administrations underestimate student capabilities. Population The study assessed more than 2,600 students in a convenience sample of 59 schools in seven states. The schools are heterogeneous with respect to demographics and type of location. Intervention There were three conditions:...
International Journal of Educational Methodology, 2021
Purpose in life is a key construct in the development of young adults, particularly college stude... more Purpose in life is a key construct in the development of young adults, particularly college students. There are many instruments measuring sense of purpose in life, but few studies have examined their measurement properties among college students. The current study compares the measurement invariance properties of the Purpose in Life (PIL) scale and the Claremont Purpose Scale (CPS) across college year and undergraduate school. Using both a unidimensional and a two-dimensional model, we found that the PIL’s interpretability is limited among college students. Using a three-dimensional model, the CPS was invariant with respect to both grouping variables. The study suggests that the CPS can be used to make meaningful comparisons among college students categorized by school year and undergraduate school. The study also has some implications about the construct of purpose in life; namely, scale structures that work well statistically and theoretically among adults might not generalize to...
A desirable goal would be to develop a methodology for scoring essays so that the final grades ar... more A desirable goal would be to develop a methodology for scoring essays so that the final grades are less affected by when or by whom each essay was read. It seems sensible to derive such grades by somehow adjusting the ratings originally given by each reader. This essay describes a solution that relies on statistical adjustment, using the context of the College Board's Advanced Placement program. Nonstatistical provisions, such as rater training, are in place to minimize the potential impact of rater differences on grades, but there is no simple way of getting a true score for an essay. The basii idea in using statistical thinking to help is to reduce the effect on scoring reliability of some of the sources of variability through calibrating readers and days on which essays are read. Estimating the relative stringency of raters and the scoring trends across time is made possible by the choice of experimental design developed by statisticians. An example illustrates the approach. Calibration experiments on five different Advanced Placement examinations showed that, in general, calibrated scores enhance reliability, but there are obstacles to overcome before the approach can be operationalized with actual essays. (Contains three tables and three references.) (SLD)
National Center for Education Statistics, Aug 1, 2006
The National Center for Education Statistics (NCES) is the primary federal entity for collecting,... more The National Center for Education Statistics (NCES) is the primary federal entity for collecting, analyzing, and reporting data related to education in the United States and other nations. It fulfills a congressional mandate to collect, collate, analyze, and report full and complete statistics on the condition of education in the United States; conduct and publish reports and specialized analyses of the meaning and significance of such statistics; assist state and local education agencies in improving their statistical systems; and review and report on education activities in foreign countries. NCES activities are designed to address high priority education data needs; provide consistent, reliable, complete, and accurate indicators of education status and trends; and report timely, useful, and high quality data to the U.S. Department of Education, the Congress, the states, other education policymakers, practitioners, data users, and the general public. Unless specifically noted, all information contained herein is in the public domain. We strive to make our products available in a variety of formats and in language that is appropriate to a variety of audiences. You, as our customer, are the best judge of our success in communicating information effectively. If you have any comments or suggestions about this or any other NCES product or report, we would like to hear from you. Please direct your comments to
Enhancing students' critical thinking (CT) skills is an essential goal of higher education. This ... more Enhancing students' critical thinking (CT) skills is an essential goal of higher education. This article presents a systematic approach to conceptualizing and measuring CT. CT generally comprises the following mental processes: identifying, evaluating, and analyzing a problem; interpreting information; synthesizing evidence; and reporting a conclusion. We further posit that CT also involves dealing with dilemmas involving ambiguity or conflicts among principles and contradictory information. We argue that performance assessment provides the most realistic-and most credible-approach to measuring CT. From this conceptualization and construct definition, we describe one possible framework for building performance assessments of CT with attention to extended performance tasks within the assessment system. The framework is a product of an ongoing, collaborative effort, the International Performance Assessment of Learning (iPAL). The framework comprises four main aspects: (1) The storyline describes a carefully curated version of a complex, real-world situation. (2) The challenge frames the task to be accomplished (3). A portfolio of documents in a range of formats is drawn from multiple sources chosen to have specific characteristics. (4) The scoring rubric comprises a set of scales each linked to a facet of the construct. We discuss a number of use cases, as well as the challenges that arise with the use and valid interpretation of performance assessments. The final section presents elements of the iPAL research program that involve various refinements and extensions of the assessment framework, a number of empirical studies, along with linkages to current work in online reading and information processing.
Psychologist Andrea diSessa coined the term "phenomenological primitives", or p-prims, to talk ab... more Psychologist Andrea diSessa coined the term "phenomenological primitives", or p-prims, to talk about nonexperts' reasoning about physical situations. P-prims are primitive in t h e sense that they stand without significant explanatory substructure or explanation. Examples are "Heavy objects fall faster than light objects" and "Continuing force is needed for continuing motion." P-prims are based on experience. They are not a coherent system; they may even contradict one another. People assemble from them a model of sorts to reason about a given situation. Intuitive physics is wrong from a physicist's point of view, but i t works just fine play fetch with your dog or push a couch across the room. It fails when you want to build a skyscraper or send a rocket to the moon. This paper considers p-prims t h a t underlie reasoning about assessment, the basis of what one might call intuitive test theory. Examples are "A test measures what it says at the top of the page," and "Scores from any two tests can be made interchangeable, with a little equating magic." Testing p-prims underlie discussions of test theory in the classroom, in the news, and in policy-making. Again, intuitive test theory works reasonably well for everyday uses like Friday's math quiz. It fails when you want to design an adaptive test, or measure the change in t h e proportion of students reading Above Basic from a matrix-sampled assessment such as NAEP.
This paper provides an historical overview of the philosophical, theoretical, and practical contr... more This paper provides an historical overview of the philosophical, theoretical, and practical contributions made by John Tukey to the field of simultaneous inference. The Problem of Multiple Comparisons, released in 1953, provided not only the first comprehensive account of the field but also set much of the research agenda for the next 35 years. During the last decade of his life, Tukey devoted substantial attention to this area, experimenting with different graphical representations of multiple comparison procedures and exploring the implications of the false discovery rate (FDR) approach to controlling family-wise error rates. In a number of publications, Tukey continued to grapple with the fundamental issues of the field and to identify critical problems to be addressed.
Publisher Summary A value-added model (VAM) refers to a family of statistical models that are emp... more Publisher Summary A value-added model (VAM) refers to a family of statistical models that are employed to make inferences about the effectiveness of educational units, usually schools and/or teachers. They are characterized by their focus on patterns in student score gains over time, rather than on student status at one point of time. In particular, they attempt to extract the estimates of the contributions of schools or teachers to student learning from the data on score trajectories. Before turning to the specifics of different VAMs, the chapter addresses a number of general issues that pertain to all such models. Each issue addresses a particular aspect of the use of VAMs in educational settings and leads to caution required in the interpretation and use of VAM results. The chapter presents two VAMs currently used operationally and two VAMs that have appeared in the research literature. It also discusses the future of this technology and policy implications of its use.
Journal of Educational and Behavioral Statistics, 1993
We present a novel approach to the empirical Bayes analysis of aggregated survival data from diff... more We present a novel approach to the empirical Bayes analysis of aggregated survival data from different groups of subjects. The method is based on a contingency table representation of the data and employs transformations to permit the use of normal priors. In contrast to the case of a single survival curve, the empirical Bayes analysis of families of such curves leads to estimates which offer a qualitative improvement over classical estimates based on the ratio of occurrence to exposure rates. This method is illustrated with data on the attainment of the doctoral degree from three major universities.
Skip to content. Taylor & Francis Online: Librarians; Authors & Editors; ... more Skip to content. Taylor & Francis Online: Librarians; Authors & Editors; Societies. Register; Sign in; Mobile. Home; Browse; Products; Redeem a voucher; Shortlist; Shopping Cart Cart. The online platform for Taylor & Francis Group content. Search. Advanced Search Within current journal Entire site. Home > List of Issues > Table of Contents > International Journal of Testing 2011 Reviewers. Browse journal. View all volumes and issues. Current issue. Most read articles. Most cited articles. Authors and submissions. Instructions for authors. Submit online. Subscribe ...
Employing nested sequences of models is a common practice when exploring the extent to which one ... more Employing nested sequences of models is a common practice when exploring the extent to which one set of variables mediates the impact of another set. Such an analysis in the context of logistic regression models confronts two challenges: (i) direct comparisons of coefficients across models are generally biased due to the changes in scale that accompany the changes in the set of explanatory variables, (ii) conducting a large number of tests induces a problem of multiplicity that can lead to spurious findings of significance if not heeded. This article aims to illustrate a practical strategy for conducting analyses in the face of these challenges. The challenges—and how to address them—are illustrated using a subset of the findings reported by Braun (Large-scale Assess Educ 6(4):1–52, 2018. 10.1186/s40536-018-0058-x), drawn from the Programme for the International Assessment of Adult Competencies (PIAAC), an international, large-scale assessment of adults. For each country in the data...
International Journal of Educational Methodology, 2021
This article introduces the concept of the carrying capacity of data (CCD), defined as an integra... more This article introduces the concept of the carrying capacity of data (CCD), defined as an integrated, evaluative judgment of the credibility of specific data-based inferences, informed by quantitative and qualitative analyses, leavened by experience. The sequential process of evaluating the CCD is represented schematically by a framework that can guide data analysis and statistical inference, as well as pedagogy. Aspects of each phase are illustrated with examples. A key initial activity in empirical work is data scrutiny, comprising consideration of data provenance and characteristics, as well as data limitations in light of the context and purpose of the study. Relevant auxiliary information can contribute to evaluating the CCD, as can sensitivity analyses conducted at the modeling stage. It is argued that early courses in statistical methods, and the textbooks they rely on, typically give little emphasis to, or omit entirely, discussion of the importance of data scrutiny in scie...
Teachers College Record: The Voice of Scholarship in Education, 2011
Background/context The National Assessment of Educational Progress (NAEP) is the only comparative... more Background/context The National Assessment of Educational Progress (NAEP) is the only comparative assessment of academic competencies regularly administered to nationally representative samples of students enrolled in Grades 4, 8, and 12. Because NAEP is a low-stakes assessment, there are long-standing questions about the level of engagement and effort of the 12th graders who participate in the assessment and, consequently, about the validity of the reported results. Purpose/Focus This study investigated the effects of monetary incentives on the performance of 12th graders on a reading assessment closely modeled on the NAEP reading test in order to evaluate the likelihood that scores obtained at regular administrations underestimate student capabilities. Population The study assessed more than 2,600 students in a convenience sample of 59 schools in seven states. The schools are heterogeneous with respect to demographics and type of location. Intervention There were three conditions:...
International Journal of Educational Methodology, 2021
Purpose in life is a key construct in the development of young adults, particularly college stude... more Purpose in life is a key construct in the development of young adults, particularly college students. There are many instruments measuring sense of purpose in life, but few studies have examined their measurement properties among college students. The current study compares the measurement invariance properties of the Purpose in Life (PIL) scale and the Claremont Purpose Scale (CPS) across college year and undergraduate school. Using both a unidimensional and a two-dimensional model, we found that the PIL’s interpretability is limited among college students. Using a three-dimensional model, the CPS was invariant with respect to both grouping variables. The study suggests that the CPS can be used to make meaningful comparisons among college students categorized by school year and undergraduate school. The study also has some implications about the construct of purpose in life; namely, scale structures that work well statistically and theoretically among adults might not generalize to...
Uploads
Papers by Henry Braun