Educational policy-makers are placing increasing emphasis on testing. All this energy devoted to standardized educational assessment presents a great opportunity for improving instructional decision-making, if testing programs can provide...
moreEducational policy-makers are placing increasing emphasis on testing. All this energy devoted to standardized educational assessment presents a great opportunity for improving instructional decision-making, if testing programs can provide instructionally meaningful results quickly. Traditional educational assessments usually treat the assessed domain as a single construct, and correspondingly use a unidimensional latent variable model to represent student knowledge on the target domain. However, a single global score for student achievement is likely useful only for the coarsest of educational decisions. To be most useful for teachers' instructional decisions, an educational test needs to provide detailed diagnostic information about students. While diagnostic assessment models do exist (such as Diagnostic Classification Models and, to a smaller extent, Multidimensional Item Response Theory), most tests that use these models ignore the structure of the sub-domains they diagnose, either treating them as independent or arbitrarily correlated. But, testing efficiencies may be available if the structural relationships between sub-domains are modeled explicitly. These structural relationships include pre-requisite relationships (learning `A' is required to learn `B') and hierarchical relationships (both `A' and `B' are concepts within the larger field of `X'). Graphical modeling (e.g., Bayesian networks and structural equation models) provides a convenient and intuitive way to represent and model these structural relationships explicitly. This dissertation investigates the use of graphical knowledge models for educational assessment in four parts: First, the existing literature on the use of Bayesian networks in educational assessment is thoroughly reviewed, identifying paths for future research. Second, a graphical knowledge model is developed for the domain of an operational curriculum-linked assessment (exams for a university physics course in classical mechanics), providing an authentic example of model development and use. Third, recovery of network, item, and person parameters for tests based on graphical knowledge models is investigated via simulation to guide certain practical questions in field-testing and calibrating such models for educational assessments. Finally, graphical knowledge models are placed in a Computerized Adaptive Testing context, and performance of a mutual information-based item selection criterion is investigated when the item bank provided different levels of information for different sub-domains. For the physics exam, all path models fit the item responses better than a unidimensional model and a multidimensional model that treated all sub-domains as independent. Both an expert-derived model for sub-domain relationships and a linear model that connected sub-domains in the order in which they were taught exhibited adequate model fit, though the linear model performed slightly better than the expert-derived model. The parameter recovery study found that the precision of person and path parameters depend on where the parameter sits in the network: Sub-domains with many descendants benefit from additional information provided by items for their descendants' sub-domains, particularly as the strength of the relationships increases. Sub-domains with many parents and few or no descendants may need greater numbers of items to achieve similar measurement precision. Moreover, the parameter recovery study demonstrated that network structures may be investigated with very modest test and sample sizes (3 items per sub-domain and 300 examinees), though larger samples are required for precise item and person parameters. Finally, the restricted mutual information-based item selection criteria investigated successfully equalized precision across sub-domains in item banks that provided more information for some sub-domains than others, and the three alternative forms of the restricted criteria performed similarly.