This study applies evolutionary algorithm-based (EA-based) symbolic regression to assess the abil... more This study applies evolutionary algorithm-based (EA-based) symbolic regression to assess the ability of metacognitive strategy use tested by the metacognitive awareness listening questionnaire (MALQ) and lexico-grammatical knowledge to predict listening comprehension proficiency among English learners. Initially, the psychometric validity of the MALQ subscales, the lexico-grammatical test, and the listening test was examined using the logistic Rasch model and the Rasch-Andrich rating scale model. Next, linear regression found both sets of predictors to have weak or inconclusive effects on listening comprehension; however, the results of EA-based symbolic regression suggested that both lexico-grammatical knowledge and two of the five metacognitive strategies tested predicted strongly and nonlinearly listening proficiency (R2 = .64). Constraining prediction modeling to linear relationships is argued to jeopardize the validity of language assessment studies, potentially leading these studies to inaccurately contradict otherwise well-established language assessment hypotheses and theories.
10 POPULAR MEASUREMENT Abstract The subjective measurement of small audible differences in the au... more 10 POPULAR MEASUREMENT Abstract The subjective measurement of small audible differences in the audio engineering field has been hampered by experimental conflicts between applicability and reproducibility. The Rasch Model offers a powerful means of controlling the statistical analysis of experimental data in order to maximize reproducibility and applicability across listeners, audio material, and devices under test. The authors describe their testing of five perceptual audio coders for Lucent Technologies . The Problem of Measuring Perception of Small Audible Impairments Measurement of listener perception of small audible impairments caused by audio reproduction devices has been constrained by the combined but conflicting needs for (a) re producible test results and (b) broadly applicable conclusions . Measurement techniques have sought to achieve reproducibility through rigorous test design and execution intended to minimize such sources of uncontrolled variance as listener trainin...
Pensamiento Educativo: Revista de Investigación Educacional Latinoamericana, 2017
El International Objective Measurement Workshop (IOMW) es una conferencia bienal a partir de la c... more El International Objective Measurement Workshop (IOMW) es una conferencia bienal a partir de la cual fueron tomados los cuatro artículos de esta edición especial de PEL. Estos se basaron en las presentaciones realizadas en Washington, DC, en Abril de 2016. IOMW siempre ha fomentado un interés por la filosofía y las posbilidades de lo que se llama ''medición objetiva'', o ''invarianza'', especifícamente según lo implementado por el modelo de Rasch. Hoy en día, ese interés es tan intenso como en la década de los 80, cuando comenzó la conferencia. A modo de introducción informal, puede ser útil revisar lo que significa ''objetividad'', cómo está arraigada en las ciencias físicas, y por qué los autores de estos documentos la consideran como un elemento importante a poseer. ¿Qué significa hacer un análisis cuantitativo de un conjunto de datos? La respuesta varía bastante en las áreas. En estadísticas, el análisis cuantitativo pretende proporcionar una descripción matemática de un conjunto de datos con énfasis en decidir si las diferencias numéricas observadas son ''significativas'', lo que quiere decir que no es probable que haya ocurrido por casualidad. Esto implica calcular los medios, las desviaciones estándar, los errores estándar, y las estadísticas relacionadas, que es más o menos el enfoque tomado de la ''teoría clásica de los tests''. En la década de los 50, pensando en un conjunto de datos de aptitud lingüística, el matemático danés Georg Rasch se dio cuenta de que una descripción estadística sobre el desempeño de los estudiantes en una prueba en particular no era lo que él quería. Supongamos que los estudiantes reciben distintos tipos de pruebas con diferentes ítems. Supongamos que los tipos de pruebas cambian cada año. Supongamos que queremos comprar estudiantes de distintos niveles. Supongamos que faltan datos, y no al azar. Bajo estas circunstancias, la descripción estadística del rendimiento en una prueba en particular no es suficiente para comparar, en ninguna manera generalizable, a los estudiantes en todas las pruebas. Adicionalmente, Rasch se dio cuenta de que quería hablar con claridad sobre estudiantes individuales, y no sobre la población como un conjunto, y no deseaba usar estadísticas que dependieran del rendimiento de otros estudiantes (''calificación ponderada por la campana de Gauss'') o de lo ítems que les podrían ser asignados, lo que parece evidentemente injusto. En resumen, Rasch quería una manera de medir la habilidad de los estudiantes que fuera tan simple, reproducible y justa, como medir la estatura de un alumno con un palo de madera o medir cantidades físicas como la fuerza y la masa con un dinamómetro o una balanza.
To an increasing degree, psychometric applications (e.g., predicting music preferences) are chara... more To an increasing degree, psychometric applications (e.g., predicting music preferences) are characterized by highly multidimensional, incomplete datasets. While the data mining and machine learning fields offer effective algorithms for such data, few specify Rasch-like conditions of objectivity. On the other hand, while Rasch models specify conditions of objectivity—made necessary by the imperative of fairness in educational testing—they do not decisively extend those conditions to multidimensional spaces. This paper asks the following questions: What must a multidimensional psychometric model do in order to be classified as “objective” in Rasch’s sense? What algorithm can meet these requirements? The paper describes a form of “alternating least squares” matrix decomposition (NOUS) that meets these requirements to a large degree. It shows that when certain well-defined empirical criteria are met, such as fit to the model, ability to predict “pseudo-missing” cells, and structural inv...
In 1960 Georg Rasch helped open the field of Item Response Theory by the model that bears his nam... more In 1960 Georg Rasch helped open the field of Item Response Theory by the model that bears his name, distinguished by the use of a single parameter to model the relationship between item difficulty and person ability. Various extensions of this relatively simple model have been proposed since then and are regularly applied in assessments. By including additional parameters in order, for example, to model variation in item discriminations (2-PL) or variation in guessing probabilities (3-PL) (Birnbaum, 1968), these extensions model the observed data more exactly and in principle improve the fit to the data of the response probabilities used to calculate test scores. However, the gain in model fit (and arguably reliability for particular item types) has a cost: not only are these models more complex but the resulting test scores are also more difficult to interpret.In the U.S., various stakeholders including courts and states have adopted the Rasch model, in part, because it leads to lo...
Form equating methods have proceeded under the assumption that test forms should be unidimensiona... more Form equating methods have proceeded under the assumption that test forms should be unidimensional, both across forms and within each form. This assumption is necessary when the data are fit to a unidimensional model, such as Rasch. When the assumption is violated, variations in the dimensional mix of the items on each test form, as well as in the mix of skills in the student population, can lead to problematic testing anomalies. The assumption ceases to be necessary, however, when data are fit to an appropriate multidimensional model. In such a scenario, it becomes possible to reproduce the same composite dimension rigorously across multiple test forms, even when the relative mix of dimensions embodied in the items on each form varies substantially. This paper applies one such multidimensional model, NOUS, to a simulated multidimensional dataset and shows how it avoids the pitfalls that can arise when fitting the same data to a single dimension. Some implications of equating multid...
While the emergence of Rasch and related IRT methodologies has made it routine to update tests ac... more While the emergence of Rasch and related IRT methodologies has made it routine to update tests across administrations without altering the original Pass/Fail standard, their insistence on unidimensionality raises a problem when the standard combines performance on multiple dimensions, such as mathematics and language. How combine a student's mathematics and language measures to make a Pass/Fail decision on composite ability when the two scales embody different dimensions and logit units? Using client-determined weights and student expected scores, we review existing methods for combining unrelated subscales, encountered in a recent high-stakes certification exam, to produce composite logit measures without sacrificing the advantages of unidimensional IRT methodologies.
This study applies evolutionary algorithm-based (EA-based) symbolic regression to assess the abil... more This study applies evolutionary algorithm-based (EA-based) symbolic regression to assess the ability of metacognitive strategy use tested by the metacognitive awareness listening questionnaire (MALQ) and lexico-grammatical knowledge to predict listening comprehension proficiency among English learners. Initially, the psychometric validity of the MALQ subscales, the lexico-grammatical test, and the listening test was examined using the logistic Rasch model and the Rasch-Andrich rating scale model. Next, linear regression found both sets of predictors to have weak or inconclusive effects on listening comprehension; however, the results of EA-based symbolic regression suggested that both lexico-grammatical knowledge and two of the five metacognitive strategies tested predicted strongly and nonlinearly listening proficiency (R2 = .64). Constraining prediction modeling to linear relationships is argued to jeopardize the validity of language assessment studies, potentially leading these studies to inaccurately contradict otherwise well-established language assessment hypotheses and theories.
10 POPULAR MEASUREMENT Abstract The subjective measurement of small audible differences in the au... more 10 POPULAR MEASUREMENT Abstract The subjective measurement of small audible differences in the audio engineering field has been hampered by experimental conflicts between applicability and reproducibility. The Rasch Model offers a powerful means of controlling the statistical analysis of experimental data in order to maximize reproducibility and applicability across listeners, audio material, and devices under test. The authors describe their testing of five perceptual audio coders for Lucent Technologies . The Problem of Measuring Perception of Small Audible Impairments Measurement of listener perception of small audible impairments caused by audio reproduction devices has been constrained by the combined but conflicting needs for (a) re producible test results and (b) broadly applicable conclusions . Measurement techniques have sought to achieve reproducibility through rigorous test design and execution intended to minimize such sources of uncontrolled variance as listener trainin...
Pensamiento Educativo: Revista de Investigación Educacional Latinoamericana, 2017
El International Objective Measurement Workshop (IOMW) es una conferencia bienal a partir de la c... more El International Objective Measurement Workshop (IOMW) es una conferencia bienal a partir de la cual fueron tomados los cuatro artículos de esta edición especial de PEL. Estos se basaron en las presentaciones realizadas en Washington, DC, en Abril de 2016. IOMW siempre ha fomentado un interés por la filosofía y las posbilidades de lo que se llama ''medición objetiva'', o ''invarianza'', especifícamente según lo implementado por el modelo de Rasch. Hoy en día, ese interés es tan intenso como en la década de los 80, cuando comenzó la conferencia. A modo de introducción informal, puede ser útil revisar lo que significa ''objetividad'', cómo está arraigada en las ciencias físicas, y por qué los autores de estos documentos la consideran como un elemento importante a poseer. ¿Qué significa hacer un análisis cuantitativo de un conjunto de datos? La respuesta varía bastante en las áreas. En estadísticas, el análisis cuantitativo pretende proporcionar una descripción matemática de un conjunto de datos con énfasis en decidir si las diferencias numéricas observadas son ''significativas'', lo que quiere decir que no es probable que haya ocurrido por casualidad. Esto implica calcular los medios, las desviaciones estándar, los errores estándar, y las estadísticas relacionadas, que es más o menos el enfoque tomado de la ''teoría clásica de los tests''. En la década de los 50, pensando en un conjunto de datos de aptitud lingüística, el matemático danés Georg Rasch se dio cuenta de que una descripción estadística sobre el desempeño de los estudiantes en una prueba en particular no era lo que él quería. Supongamos que los estudiantes reciben distintos tipos de pruebas con diferentes ítems. Supongamos que los tipos de pruebas cambian cada año. Supongamos que queremos comprar estudiantes de distintos niveles. Supongamos que faltan datos, y no al azar. Bajo estas circunstancias, la descripción estadística del rendimiento en una prueba en particular no es suficiente para comparar, en ninguna manera generalizable, a los estudiantes en todas las pruebas. Adicionalmente, Rasch se dio cuenta de que quería hablar con claridad sobre estudiantes individuales, y no sobre la población como un conjunto, y no deseaba usar estadísticas que dependieran del rendimiento de otros estudiantes (''calificación ponderada por la campana de Gauss'') o de lo ítems que les podrían ser asignados, lo que parece evidentemente injusto. En resumen, Rasch quería una manera de medir la habilidad de los estudiantes que fuera tan simple, reproducible y justa, como medir la estatura de un alumno con un palo de madera o medir cantidades físicas como la fuerza y la masa con un dinamómetro o una balanza.
To an increasing degree, psychometric applications (e.g., predicting music preferences) are chara... more To an increasing degree, psychometric applications (e.g., predicting music preferences) are characterized by highly multidimensional, incomplete datasets. While the data mining and machine learning fields offer effective algorithms for such data, few specify Rasch-like conditions of objectivity. On the other hand, while Rasch models specify conditions of objectivity—made necessary by the imperative of fairness in educational testing—they do not decisively extend those conditions to multidimensional spaces. This paper asks the following questions: What must a multidimensional psychometric model do in order to be classified as “objective” in Rasch’s sense? What algorithm can meet these requirements? The paper describes a form of “alternating least squares” matrix decomposition (NOUS) that meets these requirements to a large degree. It shows that when certain well-defined empirical criteria are met, such as fit to the model, ability to predict “pseudo-missing” cells, and structural inv...
In 1960 Georg Rasch helped open the field of Item Response Theory by the model that bears his nam... more In 1960 Georg Rasch helped open the field of Item Response Theory by the model that bears his name, distinguished by the use of a single parameter to model the relationship between item difficulty and person ability. Various extensions of this relatively simple model have been proposed since then and are regularly applied in assessments. By including additional parameters in order, for example, to model variation in item discriminations (2-PL) or variation in guessing probabilities (3-PL) (Birnbaum, 1968), these extensions model the observed data more exactly and in principle improve the fit to the data of the response probabilities used to calculate test scores. However, the gain in model fit (and arguably reliability for particular item types) has a cost: not only are these models more complex but the resulting test scores are also more difficult to interpret.In the U.S., various stakeholders including courts and states have adopted the Rasch model, in part, because it leads to lo...
Form equating methods have proceeded under the assumption that test forms should be unidimensiona... more Form equating methods have proceeded under the assumption that test forms should be unidimensional, both across forms and within each form. This assumption is necessary when the data are fit to a unidimensional model, such as Rasch. When the assumption is violated, variations in the dimensional mix of the items on each test form, as well as in the mix of skills in the student population, can lead to problematic testing anomalies. The assumption ceases to be necessary, however, when data are fit to an appropriate multidimensional model. In such a scenario, it becomes possible to reproduce the same composite dimension rigorously across multiple test forms, even when the relative mix of dimensions embodied in the items on each form varies substantially. This paper applies one such multidimensional model, NOUS, to a simulated multidimensional dataset and shows how it avoids the pitfalls that can arise when fitting the same data to a single dimension. Some implications of equating multid...
While the emergence of Rasch and related IRT methodologies has made it routine to update tests ac... more While the emergence of Rasch and related IRT methodologies has made it routine to update tests across administrations without altering the original Pass/Fail standard, their insistence on unidimensionality raises a problem when the standard combines performance on multiple dimensions, such as mathematics and language. How combine a student's mathematics and language measures to make a Pass/Fail decision on composite ability when the two scales embody different dimensions and logit units? Using client-determined weights and student expected scores, we review existing methods for combining unrelated subscales, encountered in a recent high-stakes certification exam, to produce composite logit measures without sacrificing the advantages of unidimensional IRT methodologies.
Uploads
Papers by Mark Moulton