Academia.eduAcademia.edu

Data Collection and Analysis

2007

Qualitative research: Sometimes called naturalistic; research on human systems whose hallmarks include researcher as instrument, natural settings, and little manipulation. Quantitative research: Often conceived of as more traditional or positivistic; typified by experimental or correlational studies. Data and findings are usually represented through numbers and results of statistical tests. Task complexity: Can be defined subjectively (individual characteristics, such as expertise or perception), objectively (task characteristics, such as multiple solution paths or goals), or as an interaction (individual and task characteristics).

55 Data Collection and Analysis Tamara van Gog and Fred Paas* Open University of the Netherlands, Heerlen, the Netherlands Wilhelmina Savenye Arizona State University-Tempe, Tempe, Arizona Rhonda Robinson Northern Illinois University, DeKalb, Illinois Mary Niemczyk Arizona State University-Polytechnic, Mesa, Arizona Robert Atkinson Arizona State University-Tempe, Tempe, Arizona Tristan E. Johnson Florida State University, Tallahassee, Florida Debra L. O’Connor Intelligent Decision Systems, Inc., Williamsburg, Virginia Remy M. J. P. Rikers Erasmus University Rotterdam, Rotterdam, the Netherlands Paul Ayres University of New South Wales, Sydney, Australia Aaron R. Duley National Aeronautics and Space Administration, Ames Research Center, Moffett Field, California Paul Ward Florida State University, Tallahassee, Florida Peter A. Hancock University of Central Florida, Orlando, Florida * Tamara van Gog and Fred Paas were lead authors for this chapter and coordinated the various sections comprising this chapter. 763 Tamara van Gog, Fred Paas et al. CONTENTS Introduction .....................................................................................................................................................................766 Assessment of Learning vs. Performance.............................................................................................................766 Brief Overview of the Chapter Sections ...............................................................................................................767 Assessment of Individual Learning Processes................................................................................................................767 Rationale for Using Mixed Methods.....................................................................................................................768 Analyzing Learning Using Quantitative Methods and Techniques......................................................................768 Selecting Tests .......................................................................................................................................................768 Validity .........................................................................................................................................................769 Reliability .....................................................................................................................................................769 Evaluating and Developing Tests and Test Items........................................................................................770 Scores on Numerically Based Rubrics and Checklists ...............................................................................770 Measuring Learning Processes in Technology-Mediated Communications ...............................................770 Using Technology-Based Course Statistics to Examine Learning Processes.............................................771 Measuring Attitudes Using Questionnaires That Use Likert-Type Items...................................................771 Analyzing Learning Using More Qualitative Methods and Techniques ..............................................................771 Grounded Theory .........................................................................................................................................772 Participant Observation ................................................................................................................................772 Nonparticipant Observation .........................................................................................................................772 Issues Related to Conducting Observations ................................................................................................773 Interviews .....................................................................................................................................................773 Document, Artifact, and Online Communications and Activities Analysis................................................774 Methods for Analyzing Qualitative Data.....................................................................................................774 Writing the Research Report.................................................................................................................................775 Conclusion .............................................................................................................................................................775 Assessment of Group Learning Processes......................................................................................................................776 Group Learning Processes Compared with Individual Learning Processes and Group Performance ................776 Methodological Framework: Direct and Indirect Process Measures....................................................................777 Data Collection and Analysis Techniques.............................................................................................................778 Direct Process Data Collection and Analysis..............................................................................................778 Use of Technology to Capture Group Process............................................................................................779 Use of Observations to Capture Group Process..........................................................................................779 Direct Process Data Analysis.......................................................................................................................779 Indirect Process Data Collection and Analysis ...........................................................................................780 Interviews .....................................................................................................................................................780 Questionnaires ..............................................................................................................................................780 Conceptual Methods ....................................................................................................................................781 General Considerations for Group Learning Process Assessment .......................................................................782 Group Setting ...............................................................................................................................................782 Variance in Group Member Participation....................................................................................................782 Overall Approach to Data Collection and Analysis ....................................................................................782 Thresholds ....................................................................................................................................................782 Conclusion .............................................................................................................................................................782 Assessment of Complex Performance ............................................................................................................................783 Assessment Tasks ..................................................................................................................................................784 Assessment Criteria and Standards .......................................................................................................................784 Collecting Performance Data ................................................................................................................................785 Collecting Performance Outcome (Product) Data ......................................................................................785 Collecting Performance Process Data .........................................................................................................785 Data Analysis .........................................................................................................................................................788 Analysis of Observation, Eye Movement, and Verbal Protocol Data.........................................................788 Combining Methods and Measures .............................................................................................................789 Discussion ..............................................................................................................................................................789 764 Data Collection and Analysis Setting Up a Laboratory for Measurement of Complex Performances .........................................................................789 Instrumentation and Common Configurations ......................................................................................................790 Design Patterns for Laboratory Instrumentation...................................................................................................790 Stimulus Presentation and Control Model ..................................................................................................791 Stimulus Presentation and Control Model with External Hardware ..........................................................793 Common Paradigms and Configurations.....................................................................................................795 Summary of Design Configurations ............................................................................................................796 General-Purpose Hardware....................................................................................................................................797 Data Acquisition Devices.............................................................................................................................797 Computers as Instrumentation...............................................................................................................................789 Discussion ..............................................................................................................................................................800 Concluding Remarks .......................................................................................................................................................800 References .......................................................................................................................................................................800 ABSTRACT The focus of this chapter is on methods of data collection and analysis for the assessment of learning processes and complex performance, the last part of the empirical cycle after theory development and experimental design. In the introduction (van Gog and Paas), the general background and the relation between the chapter sections are briefly described. The section by Savenye, Robinson, Niemczyk, and Atkinson focuses on methods of data collection and analysis for assessment of individual learning processes, whereas the section by Johnson and O’Connor is concerned with methods for assessment of group learning processes. The chapter section by van Gog, Rikers, and Ayres discusses the assessment of complex performance, and the final chapter section by Duley, Ward, Szalma, and Hancock is concerned with setting up laboratories to measure learning and complex performance. KEYWORDS Assessment criteria: Describe the aspects of performance that will be assessed. Assessment of learning: Measuring learning achievement, performance, outcomes, and processes by many means. Assessment standards: Describe the quality of performance on each of the criteria that can be expected of participants at different stages (e.g., age, grade) based on a participant’s past performance (selfreferenced), peer group performance (norm-referenced), or an objective standard (criterion-referenced). Collective data collection: Obtaining data from individual group members; data are later aggregated or manipulated into a representation of the group as a whole. Complex performance: Refers to real-world activities that require the integration of disparate measurement instrumentation as well as the need for timecritical experimental control. Direct process measure: Continuous elicitation of data from beginning to end of the (group) process; direct process measures involve videotaping, audiotaping, direct researcher observation, or a combination of these methods. Group: Two or more individuals working together to achieve a common goal. Group learning process: Actions and interactions performed by group members during the group learning task. Holistic data collection: Obtaining data from the group as a whole; as this type of data collection results in a representation of the group rather than individual group member, it is not necessary to aggregate or manipulate data. Indirect process measure: Discrete measure at a specific point in time during the (group) process; often involves multiple points of data collection; indirect process measures may measure processes, outcomes, products, or other factors related to group process. Instrumentation: Hardware devices used to assist with the process of data acquisition and measurement. Mixed-methods research: Studies that rely on quantitative and qualitative as well as other methods for formulating research questions, collecting and analyzing data, and interpreting findings. Online/offline measures: Online measures are recorded during task performance, offline measures are recorded after task performance. Process-tracing techniques: Records performance process data such as verbal reports, eye movements, and actions that can be used to make inferences about the cognitive processes or knowledge underlying task performance. 765 Tamara van Gog, Fred Paas et al. Qualitative research: Sometimes called naturalistic; research on human systems whose hallmarks include researcher as instrument, natural settings, and little manipulation. Quantitative research: Often conceived of as more traditional or positivistic; typified by experimental or correlational studies. Data and findings are usually represented through numbers and results of statistical tests. Task complexity: Can be defined subjectively (individual characteristics, such as expertise or perception), objectively (task characteristics, such as multiple solution paths or goals), or as an interaction (individual and task characteristics). ally advisable to conduct a pilot study to test your data collection and analysis procedures. In educational research many studies share the common goal of assessing learning or performance, and the chapter sections in this chapter provide information on methods for collecting and analyzing learning and performance data. Even though learning and performance are conceptually different, many of the data collection and analysis techniques can be used to assess both; therefore, we first discuss the differences between the assessment of learning and the assessment of performance before giving a brief overview of the content of the chapter sections. Assessment of Learning vs. Performance INTRODUCTION Tamara van Gog and Fred Paas The most important rule concerning data collection and analysis is do not attempt to collect or analyze all possible kinds of data. Unless you are conducting a truly explorative study (which is hardly ever necessary nowadays, considering the abundance of literature on most topics), the first part of the empirical cycle—the process of theory development—should result in clear research questions or hypotheses that will allow you to choose an appropriate design to study these. These hypotheses should also indicate the kind of data you will need to collect—that is, the data you have hypotheses about and some necessary control data (e.g., time on task), and together with the design provide some indications as to how to analyze those data (e.g., 2 × 2 factorial design, 2 × 2 MANCOVA). But, these are just indications, and many decisions remain to be made. To name just a few issues regarding data collection (for an elaboration on those questions, see, for example, Christensen, 2006; Sapsford and Jupp, 1996): Which participants (human/nonhuman, age, educational background, gender) and how many to use? What and how many tasks or stimuli to present and on what apparatus? What (control) measures to take? What instructions to give? What procedure to use? When to schedule the sessions? Making those decisions is not an easy task, and unfortunately strict guidelines cannot be given because acceptable answers are highly dependent on the exact nature, background, goals, and context of the study. To give you some direction, it might help to have a look at how these questions have been dealt with in high-quality studies in your domain (which are generally published in peer-reviewed, high-impact journals). Because of the importance and difficulty of finding correct operationalizations of these issues, it is gener766 The definitions of learning and performance have an important similarity, in that they can be used to refer both to an outcome or product and to a process. The term learning is used to refer to the knowledge or skill acquired through instruction or study (note that this dictionary definition ignores the possibility of informal learning, unless this is encompassed by study), as well as the process of acquiring knowledge or skill through instruction or study. The term performance is used to refer to things accomplished (outcome or product) and to the accomplishment of things (process). Performance implies the use of knowledge rather than merely possessing it. It seems that performance is more closely related to skill than to knowledge acquisition (i.e., learning), but an important difference between the definitions of learning and performance is that performance can be, but is not defined as, a result of instruction or study. The similarities and differences between these terms have some important implications for educational research. First of all, the fact that both learning and performance can refer to a product and a process enables the use of many different kinds of measures or combinations of measures to assess learning or performance. This can make it quite difficult to compare results of different studies on learning or performance, as they might have assessed different aspects of the same concept and come to very different conclusions. Second, collection and analysis of data about the knowledge an individual possesses can be used to assess their learning but not their performance. That possessing knowledge does not guarantee the ability to use it has been shown in many studies (see, for example, Ericsson and Lehmann, 1996). Nonetheless, for a long time, educational certification practices were based on this assumption: Students received their diplomas after completing a series of courses successfully, and success was usually measured by the amount of knowledge a student possessed. Given that this measure Data Collection and Analysis has no one-to-one mapping with successful performance, this practice posed many problems, both for students and employers, when students went to work after their educational trajectory. Hence, in the field of education it is recognized now that knowledge is a necessary but not sufficient condition for performance, and the field is gradually making a shift from a knowledge-based testing culture to a performance-based assessment culture (Birenbaum and Dochy, 1996). Finally, because performance is not defined as a result of instruction or study, it can be assessed in all kinds of situations, and when applied in instructional or study settings it may be assessed before, during, and after instruction or study phases. Note though, that in that case, only the difference between performance assessed before and after instruction or study is indicative for learning. One should be careful not to interpret gains in performance during instruction or study as indicators for learning, as these may be artifacts of instructional methods (Bjork, 1999). the opportunities to combine several different measures and the benefits of doing so. The fourth and final chapter section, Setting Up a Laboratory for Measurement of Complex Performances by Duley, Ward, Szalma, and Hancock, provides insight into the technical setup of laboratories for the assessment of learning processes and complex performance. Rather than providing a list of available hardware, software, and instruments, they have chosen to take the more sensible approach of familiarizing the reader with setting up configurations for stimulus presentation, control options, and response recording, which are relevant for many laboratory studies. Brief Overview of the Chapter Sections It is the goal of this section to introduce educational technology researchers to the conceptual basis and methods of data collection and analysis for investigating individual learning processes, including both quantitative and qualitative research techniques. Learning processes, of course, may involve both individual and group efforts of learners in the form of strategies and activities designed to facilitate their learning. Though this section focuses on individual processes and performances, using a variety of methods, these may be adapted for group use (see the chapter section by Johnson and O’Connor). Several assumptions guide this work. Although methods can be suggested here, the researcher must be responsible for understanding the foundational ideas of any study. He or she will want to conduct the study with the utmost attention to quality and therefore will want to turn to specific and detailed texts to learn more deeply how to apply research methods. This section will point the researcher to such references and resources. The objectives of this section are listed below. It is hoped that after reading this chapter, educational technology researchers will be able to: The first chapter section, Assessment of Individual Learning Processes by Savenye, Robinson, Niemczyk, and Atkinson, introduces educational technology researchers to the conceptual basis and methods of data collection and analysis for investigating individual learning processes. They discuss the quantitative and qualitative research paradigms and the associated approaches to data collection and analysis. They also point out the benefits of combining quantitative and qualitative approaches by conducting mixed-methods studies. The second chapter section, Assessment of Group Learning Processes by Johnson and O’Connor, focuses on the study of group learning processes, which is more complex than the study of individual learning processes. They discuss several issues that need to be considered prior to setting up a study of group learning processes, such as holistic vs. collective data collection, direct vs. indirect methods of data collection, aggregation or manipulation of individual data into group level data, and special considerations for setting up a study of group learning processes. The third chapter section, Assessment of Complex Performance by van Gog, Rikers, and Ayres, discusses data collection and analysis methods for assessment of complex performance. In line with the two-edged definition of performance as a thing accomplished or accomplishing a thing, they distinguish product and process measures and subdivide the process measures further into online (while working on a task) vs. offline (after task completion) measures. They also discuss ASSESSMENT OF INDIVIDUAL LEARNING PROCESSES Wilhelmina Savenye, Rhonda Robinson, Mary Niemczyk, and Robert Atkinson • Describe methods and techniques for conducting research on individual learning, and compare qualitative and quantitative methods. • Describe common problems in conducting and evaluating quantitative and qualitative research methods to examine learning processes. • Consider issues that contribute to the quality of studies using mixed methods. 767 Tamara van Gog, Fred Paas et al. Rationale for Using Mixed Methods The terms quantitative and qualitative are commonly used to describe contrasting research approaches. Typically, quantitative research is considered to be more numbers driven, positivistic, and traditional (Borg and Gall, 1989), while qualitative research is often used interchangeably with terms such as naturalistic, ethnographic (Goetz and LeCompte, 1984), subjective, or post-positivistic. We define qualitative research in this section as research that is devoted to developing an understanding of human systems, be they small, such as a technology-using teacher and his or her students and classroom, or large, such as a cultural system. Quantitative and qualitative methods for data collection derive in some measure from a difference in the way one sees the world, which results in what some consider a paradigm debate; however, in assessing learning processes, both approaches to data collection have importance, and using elements from both approaches can be very helpful. Driscoll (1995) suggested that educational technologists select research paradigms based on what they perceive to be the most critical questions. Robinson (1995) and Reigeluth (1989) concurred, noting the considerable debate within the field regarding suitable research questions and methods. Learning processes are complex and individual. Lowyck and Elen (2004) argued that learning processes are active, constructive, self-regulated, goal oriented, and contextualized. In addition, digital technologies are changing the nature of knowledge and of teaching and learning (Cornu, 2004). It is clear then that the methods required to collect and analyze how learning processes work, when they work, and why they work can be drawn from a mixed-method approach. Thus, researchers can investigate carefully and creatively any questions they choose and derive valid data to help understand learning processes using a combination of methods from both perspectives. Although not the main focus of this chapter, it is assumed that researchers will submit all procedures, protocols, instruments, and participation forms to the appropriate human-subjects or ethics review unit within their organizations. In any case, researchers should be specific about how they define the assumptions of the study and why what was done was done—in short, they should be able to enter into the current and upcoming discussions as thoughtful, critical, and creative researchers. Analyzing Learning Using Quantitative Methods and Techniques Learning achievement or performance in educational technology research is often the primary outcome measure or dependent variable of concern to the researcher. 768 Learning is often therefore studied using more quantitative measures, including what researchers may call tests, assessments, examinations, or quizzes. These measures may be administered in paper-and-pencil form or may be technology based. If they are technology based, they may be administered at a testing center, with tutors or proctors, or completed on the student’s own. In either format, they may be scored by an instructor or tutor or may be automatically scored by the computer (Savenye, 2004a,b). Issues of concern in selecting and developing tests and test items also are relevant when learning is measured en route as performance on practice items. Completion time, often in conjunction with testing, is another learning process variable that can efficiently be examined using quantitative methods. Learning achievement on complex tasks may also be measured more quantitatively using numerically based rubrics and checklists to evaluate products and performances or to evaluate essays or learner-created portfolios. Rubrics and checklists are also often used to derive quantitative data for measuring learning in online discussions or to build frequencies of behaviors from observations of learning processes, often used in conjunction with more qualitative methods (discussed later in this section). Many computer-based course management systems now routinely collect course statistics that may be examined to determine how learners proceed through instruction and what choices they make as they go. Self-evaluations and other aspects of learning, such as learner attitudes, are more commonly measured using questionnaires. Selected types of quantitative methods for examining learning are discussed in turn: • Tests, examinations, quizzes (administered via paper or technology, including self-evaluations) • Rubrics or checklists to measure learner performance • Measuring learning processes in technologymediated communications • Technology-based course statistics • Attitude measures such as questionnaires using Likert-type items Selecting Tests Educational researchers frequently select existing tests to assess how individual learning processes are impacted by a novel educational intervention. During this process, the researchers must be conversant with a number of important concepts, including validity and reliability. In the following sections, these concepts are Data Collection and Analysis described in greater detail with a specific focus on what researchers need to know when selecting tests. Validity Arguably, the most critical aspect of a test is its quality or validity. Simply put, a test is considered valid if it measures what it was created to measure (Borg and Gall, 1989). A test is generally considered valid if the scores it produces help individuals administering the test make accurate inferences about a particular characteristic, trait, or attribute intrinsic to the test taker. As an example, researchers exploring the relative impact of several learning environments would consider a test valid to the extent to which it helps them make an accurate determination of the relative quality and quantity of learning displayed by the students exposed to the respective learning environments. Validity is not a unitary concept; in fact, test developers use several widely accepted procedures to document the level of validity of their test, including content-related, criterion-related, and construct-related validity. Content-related validity represents the extent to which the content of a test is a representative sample of the total subject matter content provided in the learning environment. Another type of validity is criterion-related validity, which depicts how closely scores on a given test correspond to or predict performance on a criterion measure that exists outside the test. Unlike content validity, this type of validity yields a numeric value that is the correlation coefficient reported on a scale of –1 (perfect, negative relationship) to +1 (perfect, positive relationship). The third type of validity is construct-related validity, which refers to the extent to which the scores on a test correspond with a particular construct or hypothetical concept originating from a theory. Also worth mentioning is a relatively unsophisticated type of validity known as face validity, which is based on the outward appearance of the test. Although this is considered a rather rudimentary approach to establishing validity, it is considered important because of its potential impact on the test taker’s motivation. In particular, respondents may be reluctant to complete a test without any apparent face validity. Reliability Another important concept involved in test selection is reliability. Simply put, reliability refers to the consistency with which a test yields the same results for a respondent across repeated administrations (Borg and Gall, 1989). Assuming that the focus of the test— a particular attribute or characteristic—remains unchanged between test administrations for a given individual, reliability sheds light on the following question: Does the test always yield the same score for an individual when it is administered on several occasions? Determining and Judging Reliability The three basic approaches to determining the reliability of a test are test–retest, alternate forms, and internal consistency (Borg and Gall, 1989; Morris et al., 1987). Perhaps the simplest technique for estimating reliability is the test–retest method. With this approach, a test developer simply administers the test twice to the same group of respondents and then calculates the correlation between the two sets of scores. As a general rule, researchers select tests displaying the highest reliability coefficient because values approaching +1.00 are indicative of a strong relationship between the two sets of respondents’ scores; that is, the respondents’ relative performance has remained similar across the two testing occasions. Specifically, values above .80 are preferable (Chase, 1999). Another approach to determining reliability is the alternate forms method, in which two equivalent forms of a test are administered to a group of respondents on two separate occasions and the resulting scores correlated. As with the test–retest method, the higher the reliability coefficient, the more confidence a test administer can place in the ability of a test to consistently measure what it was designed to measure. The final method for estimating the reliability of a test is referred to as internal consistency. Unlike the two previous methods, it does not rely on testing the same group of respondents twice to estimate the internal consistency of a test. Instead, the reliability of a test is estimated based on a single test administration, which can be accomplished in two ways—either using the split halves method or using one of the Kuder– Richardson methods, which do not require splitting a test in half. Limits of Reliability A number of caveats are associated with reliability. First, it is important to recognize that high reliability does not guarantee validity; in other words, a test can consistently measure what it was intended to measure while still lacking validity. Knowing that a test is reliable does not permit someone to make judgments about its validity. Reliability is, however, necessary for validity, as it impacts the accuracy with which one can draw inferences about a particular characteristic or attribute intrinsic to the test taker. The reliability is impacted by several factors. Test length is the first. All things being equal, shorter tests tend to be less reliable than longer 769 Tamara van Gog, Fred Paas et al. tests because the latter afford the test developer more opportunities to accurately measure the trait or characteristic under examination. The reliability of a test is also impacted by the format of its items. A general heuristic to remember is that tests constructed with select-type items tend to be more reliable than tests with supply-type or other subjectively scored items. Evaluating and Developing Tests and Test Items The construction of learning assessments is one of the most important responsibilities of instructors and researchers. Tests should be comprised of items that represent important and clearly stated objectives and that adequately sample subject matter from all of the learning objectives. The most effective way to ensure adequate representation of items across content, cognitive processes, and objectives is to develop a test blueprint or table of specifications (Sax, 1980). Multiple types of performance measures allow students an opportunity to demonstrate their particular skills in defined areas and to receive varied feedback on their performances; this is particularly important in selfinstructional settings, such as online courses (Savenye, 2004a,b). Multiple learning measures in online settings also offer security advantages (Ko and Rossen, 2001). Tests should also give students the opportunity to respond to different types of item formats that assess different levels of cognition, such as comprehension, application, analysis, and synthesis (Popham, 1991). Different approaches and formats can yield different diagnostic information to instructors, as well; for example, well-developed multiple-choice items contain alternatives that represent common student misconceptions or errors. Short-answer item responses can give the instructor information about the student’s thinking underlying the answer (Mehrens et al., 1998). Because the test item is the essential building block of any test, it is critical to determine the validity of the test item before determining the validity of the test itself. Commercial test publishers typically conduct pilot studies (called item tryouts) to get empirical evidence concerning item quality. For these tryouts, several forms of the test are prepared with different subsets of items, so each item appears with every other item. Each form may be given to several hundred examinees. Item analysis data are then calculated, followed by assessment of the performance characteristics of the items, such as item difficulty and item discrimination (i.e., how well the item separates, or discriminates, between those who do well on the test and those who do poorly). The developers discard items that fail to display proper statistical properties (Downing and Haladyna, 1997; Nitko, 2001; Thorndike, 1997). 770 Scores on Numerically Based Rubrics and Checklists Assessing performance can be done by utilizing numerically based rubrics and checklists. Typically, two aspects of a learner’s performance can be assessed: the product the learner produces and the process a learner uses to complete the product. Either or both of these elements may be evaluated. Because performance tasks are usually complex, each task provides an opportunity to assess students on several learning goals (Nitko, 2001). Performance criteria are the specific behaviors a student should perform to properly carry out a performance or produce a product. The key to identifying performance criteria is to break down the overall performance or product into its component parts. It is important that performance criteria be specific, observable, and clearly stated (Airasian, 1996). Scoring rubrics are brief, written descriptions of different levels of performance. They can be used to summarize both performances and products. Scoring rubrics summarize performance in a general way, whereas checklists and rating scales can provide specific diagnostic information about student strengths and weaknesses (Airasian, 1996). Checklists usually contain lists of behaviors, traits, or characteristics that are either present or absent, to be checked off by an observer (Sax, 1980). Although they are similar to checklists, rating scales allow the observer to judge performance along a continuum rather than as a dichotomy (Airasian, 1996). Measuring Learning Processes in Technology-Mediated Communications Tiffin and Rajasingham (1995) suggested that education is based on communication. Online technologies, therefore, provide tremendous opportunities for learning and allow us to measure learning in new ways; for example, interactions in online discussions within Internet-based courses may be used to assess students’ learning processes. Paulsen (2003) delineated many types of learning activities, including online interviews, online interest groups, role plays, brainstorming, and project groups. These activities, generally involving digital records, will also yield online communication data for research purposes, provided the appropriate ethics and subject guidelines have been followed. The postings learners create and share may also be evaluated using the types of rubrics and checklists discussed earlier. These are of particular value to learners when they receive the assessment tools early in the course and use them to self-evaluate or to conduct peer evaluations to improve the quality of their work Data Collection and Analysis (Savenye, 2006, 2007). Goodyear (2000) reminded us that digital technologies add to the research and development enterprise the capability for multimedia communications. Another aspect of online discussions of value to researchers is that the types of postings students make and the ideas they discuss can be quantified to illuminate students’ learning processes. Chen (2005), in an online course activity conducted with groups of six students who did not know each other, found that learners under a less-structured forum condition posted many more socially oriented postings, although their performance on the task was not less than that of the students who did not post as many social postings. She also found that the more interactions a group made, the more positive students’ attitudes were toward the course. the greater the reliability. The increase is noticeable up to about seven steps; after this, the reliability begins to diminish, as it becomes difficult to develop meaningful anchors. Five-point scales tend to be the most common. Increasing the number of items can also increase reliability. Although there is considerable debate about this, many researchers hold that better results can be obtained by using an odd number of steps, which provides for a neutral response. The anchors used should fit the meaning of the statements and the goal of the measurement. Common examples include continua such as agree–disagree, effective– ineffective, important–unimportant, and like me–not like me. Using Technology-Based Course Statistics To Examine Learning Processes Although learning outcomes and processes can be productively examined using the quantitative methods discussed earlier, in a mixed-methods approach many qualitative methods are used to build a deeper understanding of what, why, and how learners learn. With the increasing use of interactive and distance technologies in education and industry, opportunities and at times the responsibility to explore new questions about the processes of learning and instruction have evolved. New technologies also enable researchers to study learners and learning processes in new ways and to expand our views of what we should investigate and how; for example, a qualitative view of how instructors and their students learn through a new technology may yield a view of what is really happening when the technology is used. As in any research project, the actual research questions guide the selection of appropriate methods of data collection. Once a question or issue has been selected, the choice of qualitative methods falls roughly into the categories of observations, interviews, and document and artifact analysis, although others have conceptualized the methods somewhat differently (Bogdan and Biklen, 1992; Goetz and LeCompte, 1984; Lincoln and Guba, 1985). Qualitative researchers have basically agreed that the human investigator is the primary research instrument (Pelto and Pelto, 1978). In this section, we begin with one approach to conducting qualitative research: grounded theory. We then discuss specific methods that may be called observations, interviews, and document and artifact analysis. As in all qualitative research, it is also assumed that educational technology researchers will use and refine methods with the view that these methods vary in their degree of interactiveness with participants. The In addition to recording learners’ performance on quizzes, tests, and other assignments, most online course management systems automatically collect numerous types of data, which may be used to investigate learning processes. Such data may include information about exactly which components of the course a learner has completed, on which days, and for how much time. Compilations of these data can indicate patterns of use of course components and features (Savenye, 2004a). Measuring Attitudes Using Questionnaires That Use Likert-Type Items Several techniques have been used to assess attitudes and feelings of learners in research studies and as part of instruction. Of these methods, Likert-type scales are the most common. Typically, respondents are asked to indicate their strength of feeling toward a series of statements, often in terms of the degree to which they agree or disagree with the position being described. Previous research has found that responding to a Likert-type item is an easier task and provides more information than ranking and paired comparisons. The advantage of a Likert-type item scale is that an absolute level of an individual’s responses can be obtained to determine the strength of the attitude (O’Neal and Chissom, 1993). Thorndike (1997) suggested several factors to consider in developing a Likert-type scale, including the number of steps, odd or even number of steps, and types of anchors. The number of steps in the scale is important as it relates to reliability—the more steps, Analyzing Learning Using More Qualitative Methods and Techniques 771 Tamara van Gog, Fred Paas et al. following qualitative methods, along with several research perspectives, are examined next: • • • • • Grounded theory Participant observations Nonparticipant observations Interviews, including group and individual Document, artifact, and online communications and activities analysis Grounded Theory In their overview of grounded theory, Strauss and Corbin (1994, p. 273) noted that it is “a general methodology for developing theory that is grounded in data systematically gathered and analyzed,” adding that it is sometimes referred to as the constant comparative method and that it is applicable as well to quantitative research. In grounded theory, the data may come from observations, interviews, and video or document analysis, and, as in other qualitative research, these data may be considered strictly qualitative or may be quantitative. The purpose of the methodology is to develop theory, through an iterative process of data analysis and theoretical analysis, with verification of hypotheses ongoing throughout the study. The researcher begins a study without completely preconceived notions about what the research questions should be and collects and analyzes extensive data with an open mind. As the study progresses, he or she continually examines the data for patterns, and the patterns lead the researcher to build the theory. The researcher continues collecting and examining data until the patterns continue to repeat and few new patterns emerge. The researcher builds the theory from the data, and the theory is thus built on, or grounded in, the phenomena. Participant Observation In participant observation, the observer becomes part of the environment, or the cultural context. The hallmark of participant observation is continual interaction between the researcher and the participants; for example, the study may involve periodic interviews interspersed with observations so the researcher can question the participants and verify perceptions and patterns. Results of these interviews may then determine what will initially be recorded during observations. Later, after patterns begin to appear in the observational data, the researcher may conduct interviews asking the participants about these patterns and why they think they are occurring. As the researcher cannot observe and record everything, in most educational research studies the inves772 tigator determines ahead of time what will be observed and recorded, guided but not limited by the research questions. Participant observation is often successfully used to describe what is happening in a context and why it happens. These are questions that cannot be answered in the standard experiment. Many researchers have utilized participant observation methods to examine learning processes. Robinson (1994) observed classes using Channel One in a Midwestern middle school; she focused her observations on the use of the televised news show and the reaction to it from students, teachers, administrators, and parents. Reilly (1994) analyzed video recordings of both the researcher and students in a project that involved defining a new type of literacy that combined print, video, and computer technologies. Higgins and Rice (1991) investigated teachers’ perceptions of testing. They used triangulation and a variety of methods to collect data; however, a key feature of the study was participant observation. Researchers observed 6 teachers for a sample of 10 hours each and recorded instances of classroom behaviors that could be classified as assessment. Similarly, Moallem (1994) used multiple methods to build an experienced teacher’s model of teaching and thinking by conducting a series of observations and interviews over a 7-month period. Nonparticipant Observation Nonparticipant observation is one of several methods for collecting data considered to be relatively unobtrusive. Many recent authors cite the early work of Webb et al. (1966) as laying the groundwork for use of all types of unobtrusive measures. Several types of nonparticipant observation have been identified by Goetz and LeCompte (1984). These include stream-of-behavior chronicles recorded in written narratives or using video or audio recordings, proxemics and kinesics (i.e., the study of uses of social space and movement), and interaction analysis protocols, typically in the form of observations of particular types of behaviors that are categorized and coded for analysis of patterns. In nonparticipant observation, observers do not interact to a great degree with those they are observing. The researchers primarily observe and record, using observational forms developed for the study or in the form of extensive field notes; they have no specific roles as participants. Examples of studies in which observations were conducted that could be considered relatively nonparticipant observation include Savenye and Strand (1989) in the initial pilot test and Savenye (1989) in the subsequent larger field test of a multimedia-based science curriculum. Of most concern during implementation Data Collection and Analysis was how teachers used the curriculum. A careful sample of classroom lessons was recorded using video, and the data were coded; for example, teacher questions were coded, and the results indicated that teachers typically used the system pauses to ask recall-level rather than higher-level questions. Analysis of the coded behaviors for what teachers added indicated that most of the teachers in the sample added examples to the lessons that would provide relevance for their own learners. Of particular value to the developers was the finding that teachers had a great degree of freedom in using the curriculum and the students’ learning achievement was still high. In a mixed-methods study, nonparticipant observations may be used along with more quantitative methods to answer focused research questions about what learners do while learning. In a mixed-methods study investigating the effects and use of multimedia learning materials, the researchers collected learning outcome data using periodic tests. They also observed learners as they worked together. These observations were video recorded and the records analyzed to examine many learning processes, including students’ level of cognitive processing, exploratory talk, and collaborative processing (Olkinuora et al., 2004). Researchers may also be interested in using observations to study what types of choices learners make while they proceed through a lesson. Klein and colleagues, for instance, developed an observational instrument used to examine cooperative learning behaviors in technology-based lessons (Crooks et al., 1995; Jones et al., 1995; Klein and Pridemore, 1994). A variation on nonparticipant observations represents a blend with trace-behavior, artifact, or document analysis. This technique, known as read-think-aloud protocols, asks learners to describe what they do and why they do it (i.e., their thoughts about their processes) as they proceed through an activity, such as a lesson. Smith and Wedman (1988) used this technique to analyze learner tracking and choices. Techniques for coding are described by Spradley (1980); however, protocol analysis (Ericsson and Simon, 1984) techniques could be used on the resulting verbal data. Issues Related to Conducting Observations Savenye and Robinson (2004, 2005) have suggested several issues that are critical to using observations to studying learning. These issues include those related to scope, biases and the observer’s role, sampling, and the use of multiple observers. They caution that a researcher can become lost in the multitudes of observational data that can be collected, both in person and when using audio or video. They recommend limiting the scope of the study specifically to answering the questions at hand. Observers must be careful not to influence the results of the study; that is, they must not make things happen that they want to happen. Potential bias may be handled by simply describing the researcher’s role in the research report, but investigators will want to examine periodically what their role is and what type of influences may result from it. In observational research, sampling becomes not random but purposive (Borg and Gall, 1989). For the study to be valid, the reader should be able to believe that a representative sample of involved individuals was observed. The multiple realities of any cultural context should be represented. If several observers will be used to collect the data, and their data will be compared or aggregated, problems with reliability of data may occur. Observers tend to see and subsequently interpret the same phenomena in many different ways. It becomes necessary to train the observers and to ensure that observers are recording the same phenomena in the same ways. When multiple observers are used and behaviors counted or categorized and tallied, it is desirable to calculate and report inter-rater reliability. Interviews In contrast with the relatively non-interactive, nonparticipant observation methods described earlier, interviews represent a classic qualitative research method that is directly interactive. Interviews may be structured or unstructured and may be conducted in groups or individually. In an information and communication technologies (ICT) study to investigate how ICT can be introduced into the context of a traditional school, Demetriadis et al. (2005) conducted a series of semistructured interviews over 2 years with 15 teachers/ mentors who offered technology training to other teachers. The cornerstone for conducting good interviews is to be sure one truly listens to respondents and records what they say rather than the researcher’s perceptions or interpretations. This is a good rule of thumb in qualitative research in general. It is best to maintain the integrity of the raw data and to make liberal use of the respondents’ own words, including quotes. Most researchers, as a study progresses, also maintain field notes that contain interpretations of patterns to be refined and investigated on an ongoing basis. Many old, adapted, and exciting techniques for structured interviewing are evolving. One example of such a use of interviews is in the Higgins and Rice (1991) study mentioned earlier. In this study, teachers sorted the types of assessment they had named previously in interviews into sets of assessments that were 773 Tamara van Gog, Fred Paas et al. most alike; subsequently, multidimensional scaling was used to analyze these data, yielding a picture of how these teachers’ viewed testing. Another type of structured interview, mentioned by Goetz and LeCompte (1984), is the interview using projective techniques. Photographs, drawings, and other visuals or objects may be used to elicit individuals’ opinions or feelings. Instructional planning and design processes have long been of interest to educational technology researchers; for example, using a case-study approach, Reiser and Mory (1991) employed interviews to examine two teachers’ instructional design and planning techniques. One of the models proposed for the design of complex learning is that of van Merriënboer et al. (1992), who developed the four-component model, which subsequently was further developed as the 4C/ID model (van Merriënboer, 1997). Such design models have been effectively studied using mixed methods, including interviews, particularly when those processes relate to complex learning. How expert designers go about complex design tasks has been investigated using both interviews and examination of the designers products (Kirschner et al., 2002). Problem-based instructional design, blending many aspects of curriculum, instruction, and media options (Dijkstra, 2004), could also be productively studied using interviews. Interviews to examine learning processes may be conducted individually or in groups. A specialized group interview method is the focus group (Morgan, 1996), which is typically conducted with relatively similar participants using a structured or semi-structured protocol to examine overall patterns in learning behaviors, attitudes, or interests. Suggestions for heightening the quality of interviews include employing careful listening and recording techniques; taking care to ask probing questions when needed; keeping the data in their original form, even after they have been analyzed; being respectful of participants; and debriefing participants after the interviews (Savenye and Robinson, 2005). Document, Artifact, and Online Communications and Activities Analysis Beyond nonparticipant observation, many unobtrusive methods exist for collecting information about human behaviors. These fall roughly into the categories of document and artifact analyses but overlap with other methods; for example, verbal or nonverbal behavior streams produced during video observations may be subjected to intense microanalysis to answer an almost unlimited number of research questions. Content anal774 ysis, as one example, may be done on these narratives. In the Moallem (1994), Higgins and Rice (1991), and Reiser and Mory (1991) studies of teachers’ planning, thinking, behaviors, and conceptions of testing, documents developed by the teachers, such as instructional plans and actual tests, were collected and analyzed. Goetz and LeCompte (1984) defined artifacts of interest to researchers as things that people make and do. The artifacts of interest to educational technologists are often written, but computer and online trails of behavior are the objects of analysis as well. Examples of artifacts that may help to illuminate research questions include textbooks and other instructional materials, such as media materials; memos, letters, and now e-mail records, as well as logs of meetings and activities; demographic information, such as enrollment, attendance, and detailed information about participants; and personal logs participants may keep. In studies in educational technology, researchers often analyze the patterns of learner pathways, decisions, and choices they make as they proceed through computer-based lessons (Savenye et al., 1996; Shin et al., 1994). Content analysis of prose in any form may also be considered to fall into this artifact-and-document category of qualitative methodology. Lawless (1994) used concept maps developed by students in the Open University to check for student understanding. Entries in students’ journals were analyzed by Perez-Prado and Thirunarayanan (2002) to learn about students’ perceptions of online and on-ground versions of the same college course. Espey (2000) studied the content of a school district technology plan. Methods for Analyzing Qualitative Data One of the major hallmarks of conducting qualitative research is that data are analyzed continually, throughout the study, from conceptualization through the entire data collection phase and into the interpretation and writing phases. In fact, Goetz and LeCompte (1984) described the processes of analyzing and writing together in what they called analysis and interpretation. Data Reduction Goetz and LeCompte (1994) described the conceptual basis for reducing and condensing data in an ongoing style as the study progresses. Researchers theorize as the study begins and build and continually test theories based on observed patterns in data. Goetz and LeCompte described the analytic procedures researchers use to determine what the data mean. These procedures involve looking for patterns, links, and relationships. In contrast to experimental research, the Data Collection and Analysis qualitative researcher engages in speculation while looking for meaning in data; this speculation will lead the researcher to make new observations, conduct new interviews, and look more deeply for new patterns in this recursive process. It is advisable to collect data in its raw, detailed form and then record patterns. This enables the researcher later to analyze the original data in different ways, perhaps to answer deeper questions than originally conceived. It should be noted that virtually all researchers who use an ethnographic approach advocate writing up field notes immediately after leaving the research site each day. If researchers have collected documents from participants, such as logs, journals, diaries, memos, and letters, these can also be analyzed as raw data. Similarly, official documents of an organization can be subjected to analysis. Collecting data in the form of photographs, films, and videos, either produced by participants or the researcher, has a long tradition in anthropology and education. These data, too, can be analyzed for meaning. (Bellman and Jules-Rosette, 1977; Bogaart and Ketelaar, 1983; Bogdan and Biklen, 1992; Collier and Collier, 1986; Heider, 1976; Hockings, 1975). Coding Data Early in the study, the researcher will begin to scan recorded data and to develop categories of phenomena. These categories are usually called codes. They enable the researcher to manage data by labeling, storing, and retrieving it according to the codes. Miles and Huberman (1994) suggested that data can be coded descriptively or interpretively. Bogdan and Biklen (1992) recommended reading data over at least several times to begin to develop a coding scheme. In one of the many examples he provided, Spradley (1979) described in extensive detail how to code and analyze interview data, which are semantic data as are most qualitative data. He also described how to construct domain, structural, taxonomic, and componential analyses. Data Management Analysis of data requires continually examining, sorting, and reexamining data. Qualitative researchers use many means to organize, retrieve, and analyze their data. To code data, many researchers simply use notebooks and boxes of paper, which can then be resorted and analyzed on an ongoing basis. Computers have long been used for managing and analyzing qualitative data. Several resources exist to aid the researcher in finding and using software for data analysis and management, including books (Weitzman and Miles, 1995) and websites that discuss and evaluate research software (American Evaluation Association, 2007; Cuneo, 2000; Horber, 2006). Writing the Research Report In some respects, writing a report of a study that uses mixed-methods may not differ greatly from writing a report summarizing a more traditional experimental study; for example, a standard format for preparing a research report includes an introduction, literature review, description of methods, and presentation of findings, completed by a summary and discussion (Borg and Gall, 1989). A mixed-methods study, however, allows the researcher the opportunity to create sections of the report that may expand on the traditional. The quantitative findings may be reported in the manner of an experimental study (Ross and Morrison, 2004). The qualitative components of research reports typically will be woven around a theme or central message and will include an introduction, core material, and conclusion (Bogdan and Biklen, 1992). Qualitative findings may take the form of a series of themes from interview data or the form of a case study, as in the Reiser and Mory (1991) study. For a case study, the report may include considerable quantification and tables of enumerated data, or it may take a strictly narrative form. Recent studies have been reported in more nontraditional forms, such as stories, plays, and poems that show participants’ views. Suggestions for writing up qualitative research are many (Meloy, 1994; Van Maanen, 1988; Wolcott, 1990). In addition to the studies mentioned earlier, many excellent examples of mixed-methods studies may be examined to see the various ways in which the results of these studies have been written. Seel and colleagues (2000), in an investigation of mental models and model-centered learning environments, used quantitative learning measures that included pretests, posttests, and a measure of the stability of learning four months after the instruction. They also used a receptive interview technique they called causal explanations to investigate learners’ mental models and learning processes. In this and subsequent studies, Seel (2004) also investigated learners’ mental models of dynamic systems using causal diagrams that learners developed and teach-back procedures, in which a student explains a model to another student and this epistemic discourse is then examined. Conclusion The challenges to educational technology researchers who choose to use multiple methods to answer their questions are many, but the outcome of choosing mixed methods has great potential. Issues of validity, reliability, and generalizability are central to experimental research (Ross and Morrison, 2004) and mixed775 Tamara van Gog, Fred Paas et al. methods research; however, these concerns are addressed quite differently when using qualitative methods and techniques. Suggestions and criteria for evaluating the quality of mixed-methods studies and research activities may be adapted from those suggested by Savenye and Robinson (2005): • Learn as much as possible about the context of the study, and build in enough time to conduct the study well. • Learn more about the methods to be used, and train yourself in these methods. • Conduct pilot studies whenever possible. • Use triangulation (simply put, use multiple data sources to yield deeper, more true views of the findings). • Be ethical in all ways when conducting research. • Listen carefully to participants, and carefully record what they say and do. • Keep good records, including audit trails. • Analyze data continually throughout the study, and consider having other researchers and participants review your themes, patterns, and findings to verify them. • Describe well all methods, decisions, assumptions, and biases. • Using the appropriate methods (and balance of methods when using mixed methods) is the key to successful educational research. ASSESSMENT OF GROUP LEARNING PROCESSES Tristan E. Johnson and Debra L. O’Connor Similar to organizations that rely on groups of workers to address a variety of difficult and challenging tasks (Salas and Fiore, 2004), groups are formed in various learning settings to meet instructional needs as well as to exploit the pedagogical, learning, and pragmatic benefits associated with group learning (Stahl, 2006). In educational settings, small groups have been typically used to promote participation and enhance learning. One of the main reasons for creating learning groups is to facilitate the development of professional skills that are promoted from group learning, such as communication, teamwork, decision making, leadership, valuing others, problem solving, negotiation, thinking creatively, and working as a member of a group (Carnevale et al., 1989). Group learning processes are the interactions of two or more individuals with themselves and their 776 environment with the intent to change knowledge, skill, or attitude. We use the term group to refer to the notion of small groups and not large groups characterized as large organizations (Levine and Moreland, 1990; Woods et al., 2000). Interest in group learning processes can be found not only in traditional educational settings such as elementary and secondary schools but also in workplace settings, including the military, industry, business, and even sports (Guzzo and Shea, 1992; Levine and Moreland, 1990). There are several reasons to assess group learning processes. These include the need to measure group learning as a process outcome and to capture the learning process to provide feedback to the group with the intent to improve team interactions and thereby improve overall team performance. Studies looking at group processes have led to improved understanding about what groups do and how and why they do what they do (Salas and Cannon-Bowers, 2000). Another reason to assess group learning processes is to capture highly successful group process behaviors to develop an interaction framework that could then be used to inform the design and development of group instructional strategies. Further, because the roles and use of groups in supporting and facilitating learning have increased, the interest in studying the underlying group mechanisms has increased. Many different types of data collection and analysis methods can be used to assess group learning processes. The purpose of this section is to describe these methods by: (1) clarifying how these methods are similar to and different from single learner methods, (2) describing a framework of data collection and analysis techniques, and (3) presenting analysis considerations specific to studying group learning processes along with several examples of existing methodologies. Group Learning Processes Compared with Individual Learning Processes and Group Performance Traditional research on learning processes has focused on psychological perspectives using traditional psychological methods. The unit of analysis for these methods emphasizes the behavior or mental activity of an individual concentrating on learning, instructional outcomes, meaning making, or cognition, all at an individual level (Koschmann, 1996; Stahl, 2006). In contrast, the framework for group research focuses on research traditions of multiple disciplines, such as communication, information, sociology, linguistics, military, human factors, and medicine, as well as fields of applied psychology such as instructional, educational, social, industrial, and organization psychology. Data Collection and Analysis As a whole, these disciplines extend the traditional psychological perspectives and seek understanding related to interaction, spoken language, written language, culture, and other aspects related to social situations. Stahl (2006) pointed out that individuals often think and learn apart from others, but learning and thinking in isolation are still conditioned and mediated by important social considerations. Group research considers various social dimensions but primarily focuses on either group performance or group learning. Group learning research is focused in typical learning settings. We often see children and youth engaged in group learning in a school setting. Adult group learning is found in post-secondary education, professional schools, vocational schools, colleges and universities, and training sessions, as well as in on-the-job training environments. A number of learning methods have been used in all of these settings. A few specific strategies that use groups to facilitate learning include cooperative learning, collaborative learning (Johnson et al., 2000; Salomon and Perkins, 1998), computer-supported collaborative learning (Stahl, 2006), and team-based learning (Michaelsen, 2004). General terms used to refer to the use of multiple person learning activities include learning groups, team learning, and group learning. Often, these terms and specific strategies are used interchangeably and sometimes not in the ways just described. In addition to learning groups, adults engage in groups activities in performance (workplace) settings. Although a distinction is made between learning and performance, the processes are similar for groups whose primary intent is to learn and for those focused on performing. The literature on workplace groups (whose primary focus is on completing a task) offers a number of techniques that can be used to study group learning processes much like the literature on individual learning. Group learning process methods include the various methods typically found when studying individuals and also have additional methodologies unique to studying groups. Methodological Framework: Direct and Indirect Process Measures When studying group learning processes, three general categories of measures can be employed: (1) the process or action of a group (direct process measures), (2) a state or a point in time of a group (indirect process measure), and (3) an outcome or performance of a group (indirect non-process measure). Direct process measures are techniques that directly capture the process of a group. These measures are continuous in nature and capture data across time by recording the sound and sight of the group interactions. Examples of these measures include recording the spoken language, written language, and visible interactions. These recording can be video or audio recordings, as well as observation notes. Indirect process measures use techniques that indirectly capture group processes. These measures are discrete and capture a state or condition of the group processes at a particular point in time, either during or after group activity. These measures involve capturing group member or observer perceptions and reactions that focus on group processes. Examples of these measures involve interviews, surveys, and questionnaires that focus on explicating the nature of a group learning process at a given point in time. These measures focus on collecting group member responses about the process and are specifically not a direct observation of the process. Indirect non-process measures capture group learning data relating to outcomes, products, performance. These are not measures of the actual process but are measures that might be related to group processes. They may include group characteristics such as demographics, beliefs, efficacy, preferences, size, background, experience, diversity, and trust (Mathieu et al., 2000). These types of measures have the potential to explicate and support the nature of group learning processes. These measures are focused on collecting products or performance scores as well as soliciting group member responses about group characteristics. These measures are not a direct observation of the group learning process. Examples of these measures include performance scores, product evaluations, surveys, questionnaires, interview transcripts, and group member knowledge structures. These measures focus on explicating the nature of a given group’s nonprocess characteristics. When considering how to assess group learning processes, many of the techniques are very similar or the same as those used to study individuals. Group learning process measures can be collected at both the individual and group levels (O’Neil et al., 2000; Webb, 1982). Because the techniques can be similar or identical for individuals and groups, some confusion may arise when it is realized that individual-level data are not in a form that can be analyzed; the data must be group-level data (group dataset; see Figure 55.1) for analysis. When designing a study on group learning processes, various measurement techniques can be used depending on the type of questions being asked. Although numerous possibilities are associated with the assessment of group learning processes, three 777 Tamara van Gog, Fred Paas et al. Group Interactions Holistic Group Dataset Holistic Group Dataset ID ID ID ID ID ID ID ID ID ID Collective Group Dataset Analysis of individual data to construct collective group dataset Group Constructs B. Indirect elicitation of individual level process data yielding collective group dataset Individual level data elicitation specific point in time Group process data elicitation specific point in time Indirect Process Measures A. Indirect elicitation of group level process data yielding holistic group dataset Group Level Data Group process data capture over time Direct Process Measures A. Direct capture of group level process data yielding holistic group dataset Figure 55.1 Alternative process measures for assessing group learning processes. elements must be considered when deciding on what techniques to use: data collection, data manipulation, or data analysis. Data collection techniques involve capturing or eliciting data related to group learning processes at an individual or group level. Data collected at the group level (capturing group interactions or eliciting group data) yield holistic group datasets (Figure 55.1). When the collected data are in this format, it is not necessary to manipulate the data. In this form, the data are ready to be analyzed. If, however, data are collected at the individual level, then the data must be manipulated, typically via aggregation (Stahl, 2006), to create a dataset that represents the group (collective group dataset) prior to data analysis (Figure 55.1). Collecting data at the individual level involves collecting individual group members’ data and then transforming the individual data to an appropriate form (collective group dataset) for analysis (see Figure 55.1). This technique of creating a collective group dataset from individual data is similar to a process referred to as analysis constructed dataset creation (O’Connor and Johnson, 2004). In this form, the data are ready to be analyzed. 778 Data Collection and Analysis Techniques When considering the different group learning process assessment techniques, they can be classified based on the type of measure (continuous or discrete). The corresponding analytical techniques that can be used are dependent on the collected data. Many techniques have been used to assess groups. The following section presents the major categories of techniques based on their ability to measure group processes directly (continuous measures) or indirectly (discrete measures). Table 55.1 summarizes the nature of data collection, manipulation, and analysis for the three major grouping of measurement techniques: direct process measures and the two variations of indirect process measures. Direct Process Data Collection and Analysis Direct process measurement techniques focus specifically on capturing the continuous process interactions in groups (O’Neil et al., 2000). These techniques include measures of auditory and visual interactions. Several data collection and data analysis techniques Data Collection and Analysis TABLE 55.1 Summary of Measurement Techniques Used to Assess Group Learning Processes Direct Process Measure Techniques—Holistic Group Dataset Data collection Directly capturing group learning processes involves techniques that are used by all group members at the same time. Data manipulation Data manipulation not needed because data is captured at group level (holistic group dataset). Data analysis Continuous process techniques focus on interactions of group members generating qualitative and quantitative findings associated with continuous measures. Indirect Process Measure Techniques—Holistic Group Dataset Data collection Indirectly eliciting group learning processes involves techniques that are used by all group members at the same time. Data manipulation Data manipulation not needed because data is captured at group level (holistic group dataset). Data analysis Discrete process techniques are dependent on dataset characteristics (focus on process or performance). They can include qualitative and quantitative data analysis techniques associated with discrete measures. Indirect Process Measure Techniques—Collective Group Dataset Data collection Indirectly eliciting group learning processes involves techniques that are used by each group member separately. Data manipulation Individual data is then aggregated to create a dataset that represents the group data (analysis constructed). Data analysis Discrete process techniques are dependent on dataset characteristics (focus on process or performance). They can include qualitative and quantitative data analysis techniques associated with discrete measures. are related to measuring the group learning processes directly. The two key techniques for capturing actions and language are (1) technology and (2) observation. Using technology to capture group processes can provide researchers with data different from the observation data. Researchers can combine the use of technology and observation simultaneously to capture group processes (O’Neil et al., 2000; Paterson et al., 2003). These data can be analyzed in the captured form or transcribed into a text form. Use of Technology to Capture Group Process Spoken Language Processes Techniques to capture a group’s spoken language involve either audio recording or video recording (Schweiger, 1986; Willis, 2002) the spoken language that occurs during group interactions (Pavitt, 1998). It can also include the spoken language of group members as they explain their thinking during group processes in the form of a think-aloud protocol (Ericsson and Simon, 1993). Written Language Processes Group learning processes are typically thought of as using spoken language, but new communication tools are available that allow groups to communicate and interact using written language. Examples include chat boards, whiteboards (although these are not limited to written language), and discussion boards. Also, computer-supported collaborative learning (CSCL) is a computer-based network system that supports group learning interactions (Stahl, 2006). Visible Processes Techniques to capture a group’s visible interactions include video recording of the behaviors and actions that occur in group interactions (Losada, 1990; Prichard, 2006; Schweiger, 1986; Sy, 2005; Willis et al., 2002). Use of Observations to Capture Group Process Although the use of technology may capture data with a high level of realism, some group events can be better captured by humans because of their ability to observe more than what can be captured by technology. Observations ideally are carried out with a set of carefully developed observation protocols to help focus the observers and to teach them how to describe key process events. Observers are a good source for capturing various types of information (Patton, 2001), such as settings, human and social environments, group activities, style and types of language used, nonverbal communication, and events that are not ordinary. Observational data, for example, are important for studying group learning process (Battistich et al., 1993; Lingard, 2002; Sy, 2005; Webb, 1982; Willis et al., 2002). The type of information typically captured includes location, organization, activities, and behaviors (Battistich et al., 1993; Losada, 1990), as well as the frequency and quality of interactions (Battistich, 1993). Direct Process Data Analysis Data that are a direct measure of group processes are captured in a holistic format that is ready to be analyzed (Figure 55.1 and Table 55.1). Several analytical 779 Tamara van Gog, Fred Paas et al. techniques are available that can be used for analyzing group data, particularly direct process data. The following list is a representative sample of the analysis techniques applied to spoken or written language, visible interaction, and observational data: sequential analysis of group interactions (Bowers, 2006; Jeong, 2003; Rourke et al., 2001), analysis of interaction communication (Bales, 1950; Qureshi, 1995), communication analysis (Bowers et al., 1998), anticipation ratio (Eccles and Tenenbaum, 2004), in-process coordination (Eccles and Tenenbaum, 2004), discourse analysis (Aviv, 2003; Hara et al., 2000), content analysis (Aviv, 2003; Hara et al., 2000), cohesion analysis (Aviv, 2003), and protocol analysis (Ericsson and Simon, 1980, 1993). Visible interactions techniques also include using a behavior time series analysis (Losada, 1990). This analysis involves looking at dominate vs. submissive, friendly vs. unfriendly, or task-oriented vs. emotionally expressive behavior. For observational data, researchers focus on various qualitative techniques associated with naturalistic observations (Adler and Adler, 1994; Patton, 2001). Some common tasks associated with this type of analysis include group and character sequence analysis and assertion evaluation (Garfinkel, 1967; Jorgensen, 1989). Indirect Process Data Collection and Analysis Many data collection techniques are related to measuring the group learning processes indirectly. Indirect group process, characteristic, and product measurement techniques elicit group information at a specific point in time. These discrete measures do not capture group processes directly but elicit data that describe group processes or process-related data such as group characteristics or group outcomes (things that may have a relation to the group processes). The three key types of data related to group learning processes are indirect group process data, group characteristic data, and group product data, within which specific factors can be measured. Indirect group process data describe group processes and can include factors such as group communication (verbal/nonverbal), group actions, group behaviors, group performance, and group processes. Group characteristic data, relating to group processes, include factors such as group knowledge, group skills, group efficacy, group attitudes, group member roles, group environment, and group leadership. The key elicitation techniques for both of these types of indirect data include interviews, questionnaires, and conceptual methods (Cooke et al., 2000). Each technique can be focused on group process or group characteristics. After reviewing methods to analyze group processes, we discuss methods for analyzing group products. 780 Interviews Interviews are a good technique for collecting general data about a group. The various types of interviewing techniques include unstructured interviews (Lingard, 2002) and more structured interviews, which are guided by a predetermined format that can provide either a rigid or loosely constrained format. Structured interviews require more time to develop but are more systematic (Cooke et al., 2000). Interviews are typically conducted with a single person at a time; however, is not uncommon to conduct a focus group, where the entire group is simultaneously interviewed. In a focus group, a facilitator interviews by leading a free and open group discussion (Myllyaho et al., 2004). The analysis of interview data requires basic qualitative data analysis techniques (Adler and Adler, 1994; Patton, 2001). Conducting interviews can be straightforward, but the analysis of the data relies tremendously on the interviewer’s interpretations (Langan-Fox, 2000). Key steps to analyzing interviews are coding the data for themes (Lingard, 2002) and then studying the codes for meaning. Each phrase is closely examined to discover important concepts and reveal overall relationships. For a more holistic approach to analysis, a group interview technique can be used to discuss findings and to generate collective meaning given specific questions (Myllyaho et al., 2004). Content analysis is commonly used to analyze written statements (LanganFox and Tan, 1997). Other key analysis techniques focus on process analysis (Fussell et al., 1998; Prichard, 2006), specifically looking at discussion topics, group coordination, group cognitive overload, and analysis of task process. Other group characteristic analysis techniques could include role analysis and power analysis (Aviv, 2003). Questionnaires Questionnaires are a commonly used technique to collect data about group processes (O’Neil et al., 2000; Sy, 2005; Webb, 1982; Willis et al., 2002). Similar to highly structured interviews, questionnaires can also look at relationship-oriented processes and task-oriented processes (Urch Druskat and Kayes, 2000). Questionnaires can be either closed ended or open ended (Alavi, 1994). Open-ended questionnaires are more closely related to a structured interview; the data collected using this format can be focused on group processes as well as group characteristics. Closedended questionnaires offer a limited set of responses. The limited responses involve some form of scale that could be nominal, ordinal, interval, or ratio. Data from Data Collection and Analysis this format have a limited ability to capture group process data, but this is the typical format for collecting data associated with group characteristics such as social space, group efficacy scales, group skills, group efficacy, group attitudes, group member roles, leadership, and group knowledge. Data from questionnaires can be analyzed much like interview data if the items are open ended. If the questionnaire is closed ended, then the instrument must be scrutinized for reliability prior to data analysis. Assuming sufficient evidence of reliability, analyzing data from closed-ended questionnaires involves interpreting a measurement based on a particular theoretical construct. The types of data analysis techniques that are appropriate depend on the type of scale used in a questionnaire (nominal, ordinal, interval, or ratio). Conceptual Methods Conceptual methods involve assessing individual or group understanding about a given topic. Several data collection techniques are utilized to elicit knowledge structures. A review of the literature by Langan-Fox et al. (2000) found that knowledge in teams has been investigated by several qualitative and quantitative methods, including various elicitation techniques (e.g., cognitive interviewing, observation, card sorting, causal mapping, pairwise ratings) and representation techniques (e.g., MDS, distance ratio formulas, Pathfinder) that utilize aggregate methods. One of the most common methods for assessing group knowledge is the use of concept maps (Herl et al., 1999; Ifenthaler, 2005; O’Connor and Johnson, 2004; O’Neil et al., 2000). Through concept mapping, similarity of group mental models can be measured in terms of the proportion of nodes and links shared between one concept map (mental model) and another (Rowe and Cooke, 1995). Several researchers believe that group knowledge and group processes are linked. Research has shown that specific group interactions such as communication and coordination mediate the development of group knowledge and thus mediate group performance (Mathieu et al., 2000). Group interactions coupled with group shared knowledge are a predominate force in the construct of group cognition. As teammates interact, they begin to share knowledge, thus enabling them to interpret cues in similar ways, make compatible decisions, and take proper actions (Klimoski and Mohammed, 1994). Group shared knowledge can help group members explain other members’ actions, understand what is occurring with the task, develop accurate expectations about future member actions and task states, and communicate meanings efficiently. Analyzing knowledge data can certainly involve qualitative methods. These methods tend to offer more detail and depth of information than might be found through statistical analyses (Miles and Huberman, 1994; Patton, 2001). Using qualitative analysis, we obtain greater understanding about the relationships between concepts within the context of the individual mental model. We also gain better insight into the sharedness of understanding between group members. Quantitative data analysis techniques provide researchers with tools to draw inferences on the change in group knowledge as well as statistically proving a change or variation in knowledge structures. Several methods have been developed to analyze data regarding group knowledge. Most of them include an elicitation and analysis component. Some techniques use mixed methods such as the analysis-constructed shared mental model (ACSMM) (O’Connor and Johnson, 2004), DEEP (Spector and Koszalka, 2004), and social network analysis (Qureshi, 1995). Other methods are quantitative in nature, such as the Stanford Microarray Database (SMD) (Ifenthaler, 2005), Model Inspection Trace of Concepts of Relations (MITOCAR) (Pirnay-Dummer, 2006), multidimensional scaling (MDS), distance ratio formula, and Pathfinder (Cooke et al., 2000). Group product data are the artifacts created from a group interaction. Group products typically do not capture the process that a group undertook to create the product but is evidence of the group’s abilities. Many research studies that claim to study group processes only capture group product data. This is due in part to the claim that is made regarding the link between group products and group processes and characteristics (Cooke et al., 2000; Lesh and Dorr, 2003; Mathieu et al., 2000; Salas and Cannon-Bowers, 2001; Schweiger, 1986). Although some evidence suggests this relationship in a few areas, more research is required to substantiate this claim. Analysis of the group product data involves techniques used when analyzing individual products. Analyzing the quality of a product can be facilitated by the use of specified criteria. These criteria are used to create a product rating scale. Rating scales can include numerical scales, descriptive scales, or checklists. Numerical scales present a range of numbers (usually sequential) that are defined by a label on either end. Each item in the questionnaire is rated according to the numerical scale. There is no specific definition of what the various numbers mean, except for the indicators at the ends of the scale; for example, a scale from 1 (very weak) to 5 (very strong) is very subjective but relatively easy to create. Descriptive scales are similar, but focus on verbal statements. Numbers can 781 Tamara van Gog, Fred Paas et al. be assigned to each statement. Statements are typically in a logical order. A common example of a descriptive scale is “strongly disagree (1), disagree (2), neutral (3), agree (4), and strongly agree (5).” A checklist can be developed to delineate specific qualities for a given criterion. This can provide a high level of reliability because a specific quality is presented and the rater simply indicates whether an item is present or not. The validity of a checklist requires a careful task analysis to ensure scale validity. General Considerations for Group Learning Process Assessment In assessment of group learning processes, researchers should consider several issues. These issues fall into four categories: (1) group setting, (2) variance in group member participation, (3) overall approach to data collection and analysis, and (4) thresholds. Group Setting To a somewhat lesser degree than the other three issues, group settings should be considered when determining the best approach and methods for a particular study. Finalizing which techniques to use may depend on whether the groups will be working in a group learning setting or individually and then coming together as a group at various points. Some groups may meet in faceto-face settings or other settings that allow for synchronous interactions; however, distributed groups have technology-enabled synchronous and asynchronous tools or asynchronous interactions only. These variations in group setting can influence the selection of specific group learning process assessment methods. Variance in Group Member Participation When collecting multiple sets of data over time, researchers should consider how they will deal with a variance in group member participation (group members absent during data collection or new members joining the group midstream). There are benefits and consequences for any decision made, but it is necessary to determine whether or not all data collected will be used, regardless of the group members present at the time of data collection. Researchers who choose not to use all data might consider using only those data submitted by group members who were present during all data collection sessions (O’Connor and Johnson, 2004). If data analysis will be based on a consistent number of group members, it will be necessary to consider how to handle data from groups that may not have the same group members present in each measure 782 point. Also, with fluctuations in group compositions, it is important to consider the overall group demographics and possible influences of individual group members on the group as a whole. Overall Approach to Data Collection and Analysis In a holistic approach, individual group members work together and one dataset represents the group as a whole; however, the processes of group interaction naturally changes how individual group members think. The alternative is to capture individual measures and perform some type of aggregate analysis methods to represent the group; however, researchers should consider whether or not the aggregate would be a true representation of the group itself. Thresholds When using indirect measures that require an aggregation or manipulation of data prior to analysis, researchers will have to consider such issues as similarity scores. These scores define the parameters for deciding if responses from one individual group member are similar to the responses from other group members (O’Connor and Johnson, 2004; Rentsch and Hall, 1994; Rentsch et al., in press); for example, will the rating of 3.5 on a 5-point scale be considered similar to a rating of 4.0 or a rating of 3.0? When aggregating individual data into a representation of the group, will the study look only at groups where a certain percentage of the group responded to measures (Ancona and Caldwell, 1991)? How will what is similar or shared across individual group members be determined? Will the analysis use counts (x number of group members) or percentage of the group (e.g., 50%)? What level of similarity or sensitivity will be used to compare across groups (O’Connor and Johnson, 2004)—50%? 75%? What about the level of mean responses in questionnaires (Urch Druskat and Kayes, 2000)? Many different thresholds that must be considered when assessing group learning processes and analyzing group data are not concerns when studying individuals. Conclusion Assessment of group learning processes is more complex than assessment of individual learning processes because of the additional considerations necessary for selecting data collection and analysis methods. As in most research, the “very type of experiment set up by researchers will determine the type of data and therefore what can be done with such data in analysis and interpretation” (Langan-Fox et al., 2004, p. 348). Data Collection and Analysis Indeed, it is logical to allow the specific research questions to drive the identification of data collection methods. The selection of research questions and subsequent identification of data collection methods will naturally place limitations on suitable data analysis methods. Careful planning for the study of group learning processes, from the selection of direct or indirect assessment measures to considering the possible influences group characteristics may have on group learning processes, is essential. Because of the many possible combinations of methods and techniques available for studying group learning processes, some feel that research has not yet done enough to determine the best methods for studying groups (Langan-Fox et al., 2000). Many group learning process studies consider only outcome measures and do not directly study group learning processes (Worchel et al., 1992). Others look only at portions of the group process or attempt to assess group learning processes through a comparison of discrete measures of the group over time. Still other methods for data collection and analysis of group data are being developed as we speak (Seel, 1999). No one best method has been identified for analyzing group learning process data, so we suggest that studies should consider utilizing multiple methods to obtain a more comprehensive picture of group learning processes. If we are to better understand the notion of group learning processes and utilize that understanding in design, implementation, and management of learning groups in the future, then we must address the basic issues that are related with conceptualization and measurement (Langan-Fox et al., 2004). ASSESSMENT OF COMPLEX PERFORMANCE Tamara van Gog, Remy M. J. P. Rikers, and Paul Ayres This chapter section discusses assessment of complex performance from an educational research perspective, in terms of data collection and analysis. It begins with a short introduction on complex performance and a discussion on the various issues related to selecting and defining appropriate assessment tasks, criteria, and standards that give meaning to the assessment. Although many of the issues discussed here are also important for performance assessment in educational practice, readers particularly interested in this topic might want to refer, for example, to Chapter 44 in this Handbook or the edited books by Birenbaum and Dochy (1996) and Segers et al. (2003). For a discussion of laboratory setups for data collection, see the section by Duley et al. in this chapter. Complex performance can be defined as performance on complex tasks; however, definitions of task complexity differ: Campbell (1988), in a review of the literature, categorized complexity as primarily subjective (psychological) or objective (function of objective task characteristics), or as an interaction between objective and subjective (individual) characteristics. Campbell reported that the subjective perspective emphasized psychological dimensions such as task significance and identity. On the other hand, objective definitions consider the degree of structuredness of a task or of the possibility of multiple solution paths (Byström and Järvelin, 1995; Campbell, 1988). When the process of task performance can be described in detail a priori (very structured), a task is considered less complex; in contrast, when there is a great deal of uncertainty, it is considered highly complex. Similarly, complexity can vary according to the number of solutions paths possible. When there is just one correct solution path, a task is considered less complex than when multiple paths can lead to a correct solution or when multiple solutions are possible. For the interaction category, Campbell (1988) argued that both the problem solver and the task are important. By defining task complexity in terms of cognitive load (Chandler and Sweller, 1991; Sweller, 1988; Sweller et al., 1998) an example of this interaction can readily be shown. From a cognitive load perspective, complexity is defined by the number of interacting information elements a task contains, which have to be simultaneously handled in working memory. As such, complexity is influenced by expertise (i.e., subjective, individual characteristic); what may be a complex task for a novice may be a simple task for an expert, because a number of elements have been combined into a cognitive schema that can be handled as a single element in the expert’s working memory. Tasks that are highly complex according to the objective definition (i.e., lack of structuredness and multiple possible solution paths) will also be complex in the interaction definition; however, according to the latter definition, even tasks with a high degree of structuredness or one correct solution path can be considered complex, given a high number of interacting information elements or low performer expertise. In this chapter section, we limit our discussion to methods of assessment of complex performance on cognitive tasks. What is important to note throughout this discussion is that the methods described here can be used to assess (improvements in) complex performance both during training and after training, depending on the research questions one seeks to address. After training, performance assessment usually has the goal to assess learning, which is a goal of many studies 783 Tamara van Gog, Fred Paas et al. in education and instructional design. If one seeks to assess learning, one must be careful not to conclude that participants have learned because their performance improved during training. As Bjork (1999) points out, depending on the training conditions, high performance gains during training may not be associated with learning, whereas low performance gains may be. It is important, therefore, to assess learning on retention or transfer tasks, instead of on practice tasks. Selection of appropriate assessment tasks is an important issue, which is addressed in the next section. Assessment Tasks An essential step in the assessment of performance is the identification of a collection of representative tasks that capture those aspects of the participant’s knowledge and skills that a study seeks to address (Ericsson, 2002). Important factors for representativeness of the collection of assessment tasks are authenticity, number, and duration of tasks, all of which are highly influenced by the characteristics of the domain of study. Selecting tasks that adequately capture performance often turns out to be very difficult. Selecting atypical or artificial tasks may even impede learners in demonstrating their true level of understanding. Traditional means to evaluate the learners’ knowledge or skills have been criticized because they often fail to demonstrate that the learner can actually do something in real life or in their future workplace with their knowledge and skills they have acquired during their training (see, for example, Anderson et al., 1996; Shepard, 2000; Thompson, 2001). The argument for the use of authentic tasks to assess the learners’ understanding has a long tradition. It started in the days of John Dewey (1916) and continues to the present day (Merrill, 2002; van Merriënboer, 1997); however, complete authenticity of assessment tasks may be difficult to realize in research settings, because the structuredness of the domain plays a role here. For structured domains such as chess and bridge, the same conditions can be reproduced in a research laboratory as those under which performance normally takes place; for less or ill-structured domains, this is difficult or even impossible to do (Ericsson and Lehmann, 1996). Nonetheless, one can always strive for a high degree of authenticity. Gulikers et al. (2004) defined authentic assessment as a fivedimensional construct (i.e., task, social context, physical context, form/result, and criteria) that can vary from low to high on each of the dimensions. The number of tasks in the collection and the duration are important factors influencing the reliability 784 and generalizability of a study. Choosing too few tasks or tasks of too short duration will negatively affect reliability and generalizability. On the other hand, choosing a large number of tasks or tasks of a very long duration will lead to many practical problems and might exhaust both participants and researchers. In many complex domains (e.g., medical diagnosis), it is quite common and often inevitable to use a very small set of cases because of practical circumstances and because detailed analysis of the learners’ responses to these complex problems is very difficult and time consuming (Ericsson, 2004; Ericsson and Smith, 1991). Unfortunately, however, there are no golden rules for determining the adequate number of tasks to use or their duration, because important factors are highly dependent upon the domain and specific context (Van der Vleuten and Schuwirth, 2005). It is often easier to identify a small collection of representative tasks that capture the relevant aspects of performance in highly structured domains (e.g., physics, mathematics, chess) than in ill-structured domains (e.g., political science, medicine), where a number of interacting complex skills are required. Assessment Criteria and Standards The term assessment criteria refers to a description of the elements or aspects of performance that will be assessed, and the term assessment standards refers to a description of the quality of performance (e.g., excellent/good/average/poor) on each of those aspects that can be expected of participants at different stages (e.g., age, grade) (Arter and Spandel, 1992). As Woolf (2004) pointed out, however, the term assessment criteria is often used in the definition of standards as well. Depending on the question one seeks to answer, different standards can be used, such as a participant’s past performance (self-referenced), peer group performance (norm-referenced), or an objective standard (criterion-referenced), and there are different methods for setting standards (Cascallar and Cascallar, 2003). Much of the research on criteria and standard setting has been conducted in the context of educational practice for national (or statewide) school tests (Hambleton et al., 2000) and for highly skilled professions, such as medicine, where the stakes of setting appropriate standards are very high (Hobma et al., 2004; Van der Vleuten and Schuwirth, 2005). Although formulation of good criteria and standards is extremely important in educational practice, where certification is the prime goal, it is no less important in educational research settings. What aspects of performance are measured and what standards are set have a major impact on the generalizability and value of a study. Data Collection and Analysis The degree to which the domain is well structured influences not only the creation of a collection of representative tasks but also the definition of criteria, setting of standards, and interpretation of performance in relation to standards. In highly structured domains, such as mathematics or chess, assessing the quality of the learner’s response is often fairly straightforward and unproblematic. In less structured domains, however, it is often much more difficult to identify clear standards; for example, a music student’s interpretation of a piano concerto is more difficult to assess than the student’s technical performance on the piece. The former contains many more subjective elements (e.g., taste) or cultural differences than the latter. Collecting Performance Data No one best method for complex performance assessment exists, and it is often advisable to use multiple measures or methods in combination to obtain as complete a picture as possible of the performance. A number of methods are described here for collecting performance outcome (product) and performance process data. Methods are classified as online (during task performance) or offline (after task performance). Which method or combination of methods is the most useful depends on the particular research question, the possible constraints of the research context, and the domain. In ill-structured domains, for example, the added value of process measures may be much higher than in highly structured domains. Collecting Performance Outcome (Product) Data Collecting performance outcome data is quite straightforward. One takes the product of performance (e.g., an electrical circuit that was malfunctioning but is now repaired) and scores it along the defined criteria (e.g., do all the components function as they should, individually and as a whole?). Instead of assigning points for correct aspects, one can count the number of errors, and analyze the types of errors made; however, especially for assessment of complex performance, collecting performance product data alone is not very informative. Taking into account the process leading up to the product and the cognitive costs at which it was obtained provides equally if not more important information. Collecting Performance Process Data Time on Task or Speed An important indication of the level of mastery of a particular task is the time needed to complete a task. According to the power law of practice (Newell and Rosenbloom, 1981; VanLehn, 1996), the time needed to complete a task decreases in proportion to the time spent in practice, raised to some power. Newell and Rosenbloom (1981) found that this law operates across a broad range of tasks, from solving geometry problems to keyboard typing. To account for the power law of practice, several theories have been put forward. Anderson’s ACT-R explains the speed-up by assuming that slow declarative knowledge is transformed into fast procedural knowledge (Anderson, 1993; Anderson and Lebiere, 1998). Another explanation suggested that speed-up is the result of repeated encounters with meaningful patterns (Ericsson and Staszewski, 1989); that is, as a result of frequent encounters with similar elements, these elements will no longer be perceived as individual units but will be perceived as a meaningful whole (i.e., chunk). In addition to chunking, automation processes (Schneider and Shiffrin, 1977; Shiffrin and Schneider, 1977) occur with practice that allow for faster and more effortless performance. In summary, as expertise develops equal performance can be attained in less time; therefore, it is important to collect time-on-task data to assess improvements in complex performance. Cognitive Load The same processes of chunking and automation that are associated with decreases in the time required to perform a task are also responsible for decreases in the cognitive load imposed by performing the task (Paas and van Merriënboer, 1993; Yeo and Neal, 2004). Cognitive load can be measured using both online and offline techniques. The cognitive capacity that is allocated to performing the task is defined as mental effort, which is considered to reflect the actual cognitive load a task imposes (Paas and van Merriënboer, 1994a; Paas et al., 2003). A subjective but reliable technique for measuring mental effort is having individuals provide self-ratings of the amount of mental effort invested. A single-scale subjective rating instrument can be used, such as the nine-point rating scale developed by Paas (1992), or a multiple-scale instrument, such as the NASA Task Load Index (TLX), which was used, for example by Gerjets et al. (2004, 2006). As subjective cognitive load measures are usually recorded after each task or after a series of tasks has been completed they are usually considered to be offline measurements, although there are some exceptions; for example, Ayres (2006) required participants to rate cognitive load at specific points within tasks. Objective online measures include physiological measures such as heart-rate variability (Paas and van Merriënboer, 1994b), eye-movement data, and secondary-task procedures (Brünken et al., 2003). Because 785 Tamara van Gog, Fred Paas et al. they are taken during task performance, those online measures can show fluctuations in cognitive load during task performance. It is notable, however, that Paas and van Merriënboer (1994b) found the heart-rate variability measure to be quite intrusive as well as insensitive to subtle fluctuations in cognitive load. The subjective offline data are often easier to collect and analyze and provide a good indication of the overall cognitive load a task imposed (Paas et al., 2003). Actions: Observation and Video Records Process-tracing techniques are very well suited to assessing the different types of actions taken during task performance, some of which are purely cognitive, whereas others result in physical actions, because the “data that are recorded are of a pre-specified type (e.g., verbal reports, eye movements, actions) and are used to make inferences about the cognitive processes or knowledge underlying task performance” (Cooke, 1994, p. 814). Ways to record data that allow the inference of cognitive actions are addressed in the following sections. The following options are available for recording the physical actions taken during task performance: (1) trained observers can write down the actions taken or check them off on an a priori constructed list (use multiple observers), (2) a (digital) video record of the participants’ performance can be made, or (3) for computer-based tasks, an action record can be made using screen recording software or software that logs key presses and coordinates of mouse clicks. Attention and Cognitive Actions: Eye-Movement Records Eye tracking (Duchowski, 2003)—that is, recording eye-movement data while a participant is working on a (usually, but not necessarily computer-based) task—can also be used to gather online performance process data but is much less used in educational research than the above methods. Eye-movement data give insights into the allocation of attention and provide a researcher with detailed information of what a participant is looking at, for how long, and in what order. Such data allow inferences to be made about cognitive processes (Rayner, 1998), albeit cautious inferences, as the data do not provide information on why a participant was looking at something for a certain amount of time or in a certain order. Attention can shift in response to exogenous or endogenous cues (Rayner, 1998; Stelmach et al., 1997). Exogenous shifts of attention occur mainly in response to environmental features or changes in the environment (e.g., if something brightly colored would start flashing in the corner of a computer screen, your attention 786 would be drawn to it). Endogenous shifts are driven by knowledge of the task, of the environment, and of the importance of available information sources (i.e., influenced by expertise level) (Underwood et al., 2003). In chess, for example, it was found that experts fixated proportionally more on relevant pieces than non-expert players (Charness et al., 2001). In electrical circuits troubleshooting, van Gog et al. (2005a) also found that participants with higher expertise fixated more on a fault-related component during problem orientation than participants with lower expertise.* Please note that this is not an exhaustive overview and that we have no commercial or other interest in any of the programs mentioned here. Haider and Frensch (1999) used eye-movement data to corroborate their information-reduction hypothesis, which states that with practice people learn to ignore task-redundant information and limit their processing to task-relevant information. On tasks with many visual performance aspects (e.g., troubleshooting technical systems), eye-movement records may therefore provide much more information than video records. Some important problemsolving actions may be purely visual or cognitive, but those will show up in an eye-movement record, whereas a video record will only allow inferences of visual or cognitive actions that resulted in manual or physical actions (van Gog et al., 2005b). In addition to providing information on the allocation of attention, eye-movement data can also give information about the cognitive load that particular aspects of task performance impose; for example, whereas pupil dilation (Van Gerven et al., 2004) and fixation duration (Underwood et al., 2004) are known to increase with increased processing demands, the length of saccades (i.e., rapid eye movements from one location to another; see Duchowski, 2003) is known to decrease. (For an indepth discussion of eye-movement data and cognitive processes, see Rayner, 1998.) Thought Processes and Cognitive Actions: Verbal Reports Probably the most widely used verbal reporting techniques are concurrent and retrospective reporting (Ericsson and Simon, 1993). As their names imply, concurrent reporting is an online technique, whereas retrospective reporting is an offline technique. Concurrent reporting, or thinking aloud, requires participants to verbalize all thoughts that come to mind during task * Note that the expertise differences between groups were relatively small (i.e., this was not an expert–novice study), suggesting that eyemovement data may be a useful tool in investigating relatively subtle expertise differences or expertise development. Data Collection and Analysis performance. Retrospective reporting requires participants to report the thoughts they had during task performance immediately after completing it. Although there has been considerable debate over the use of verbal reports as data, both methods are considered to allow valid inferences to be made about the cognitive processes underlying task performance, provided that verbalization instructions and prompts are carefully worded (Ericsson and Simon, 1993). Instructions and prompts should be worded in such a way that the evoked responses will not interfere with the cognitive processes as they occur during task performance; for example, instructions for concurrent reporting should tell participants to think aloud and verbalize everything that comes to mind but should not ask them to explain any thoughts. Prompts should be as unobtrusive as possible. Prompting participants to “keep thinking aloud” is preferable over asking them “what are you thinking?” because this would likely evoke self-reflection and, hence, interfere with the cognitive processes. Deviations from these instructional and prompting techniques can change either the actual cognitive processes involved or the processes that were reported, thereby compromising the validity of the reports (Boren and Ramey, 2000; Ericsson and Simon, 1993). Magliano et al. (1999), for example, found that instructions to explain, predict, associate, or understand during reading influenced the inferences from the text that participants generated while thinking aloud. Although the effect of instructions on cognitive processes is an interesting topic of study, when the intention is to elicit reports of the actual cognitive processes as they would occur without intervention, Ericsson and Simon’s (1993) guidelines for wording instructions and prompts should be adhered to. Both reporting methods can result in verbal protocols that allow for valid inferences about cognitive processes; however, the potential for differences in the information they contain must be considered when choosing an appropriate method for answering a particular research question. According to Taylor and Dionne (2000), concurrent protocols mostly seem to provide information on actions and outcomes, whereas retrospective protocols seem to provide more information about “strategies that control the problem solving process” and “conditions that elicited a particular response” (p. 414). Kuusela and Paul (2000) reported that concurrent protocols contained more information than retrospective protocols, because the latter often contained only references to the effective actions that led to the solution. van Gog et al. (2005b) investigated whether the technique of cued retrospective reporting, in which a retrospective report is cued by a replay of a record of eye movements and mouse/keyboard operations made during the task, would combine the advantages of concurrent (i.e., more action information) and retrospective (i.e., more strategic and conditional information) reporting. They found that both concurrent and cued retrospective reporting resulted in more action information, as well as in more strategic and conditional information, than retrospective reporting without a cue. Contrary to expectations, concurrent reporting resulted in more strategic and conditional information than retrospective reporting. This may (1) reflect a genuine difference from Taylor and Dionne’s results, (2) have been due to different operationalizations of the information types in the coding scheme used, or (3) have been due to the use of a different segmentation method than those used by Taylor and Dionne (2000). An explanation for the finding that concurrent reports result in more information on actions than retrospective reports may be that concurrent reporting occurs online rather than offline. Whereas concurrent reports capture information available in short-term memory during the process, retrospective reports reflect memory traces of the process retrieved from short-term memory when tasks are of very short duration or from long-term memory when tasks are of longer duration (Camps, 2003; Ericsson and Simon, 1993). It is likely that only the correct steps that have led to attainment of the goal are stored in long-term memory, because only these steps are relevant for future use. This is why having participants report retrospectively based on a record of observations or intermediate products of their problem-solving process is known to lead to better results (due to fewer omissions) than retrospective reporting without a cue (van Gog et al., 2005b; Van Someren et al., 1994). Possibly, the involvement of different memory systems might also explain Taylor and Dionne’s (2000) finding that retrospective protocols seem to contain more conditional and strategic information. This knowledge might have been used during the process but may have been omitted in concurrent reporting as a result of the greater processing demands this method places on short-term memory (Russo et al., 1989). Although this explanation is tentative, there are indications that concurrent reporting may become difficult to maintain under high cognitive load conditions (Ericsson and Simon, 1993). Indeed, participants in van Gog et al.’s study who experienced a higher cognitive load (i.e., reported investment of more mental effort) in performing the tasks indicated during a debriefing after the experiment that they disliked concurrent reporting and preferred cued retrospective reporting (van Gog, 2006). 787 Tamara van Gog, Fred Paas et al. Neuroscientific Data An emerging and promising area of educational research is the use of neuroscience methodologies to study (changes in) brain functions and structures directly, which can provide detailed data on learning processes, memory processes, and cognitive development (see, for example, Goswami, 2004; Katzir and Paré-Blagoev, 2006). Methods such as magnetic resonance imaging (MRI), functional magnetic resonance imaging (fMRI), electroencephalography (EEG), magnetoencephalography (MEG), positron-emission tomography (PET), and single-photon emission computed tomography (SPECT) provide (indirect) measures of neuronal activity. The reader is referred to Katzir and Paré-Blagoev (2006) for a discussion of these methods and examples of their use in educational research. Data Analysis Analyzing performance product, time on task, and mental effort data (at least when the subjective rating scales are used) is a very straightforward process, so it is not discussed here. In this section, analysis of observation, eye movement, and verbal protocol data is discussed, as well as the analysis of combined methods/measures. Analysis of Observation, Eye Movement, and Verbal Protocol Data Observation Data Coding and analysis of observation data can take many different forms, again depending on the research question. Coding schemes are developed based on the performance aspects (criteria) one wishes to assess and sometimes may incorporate evaluation of performance aspects. Whether coding is done online (during performance by observers) or offline (after performance based on video, screen capture, or mouse-keyboard records), the use of multiple observers or raters is important for determining reliability of the coding. Quantitative analysis on the coded data can take the form of comparison of frequencies, appropriateness (e.g., number of errors), or sequences of actions (e.g., compared to an ideal or expert sequence) and interpreting the outcome in relation to the set standard. Several commercial and noncommercial software programs have been developed to assist in the analysis of action data;* for example, Observer (Noldus et al., * Please note that this is not an exhaustive overview and that we have no commercial or other interest in any of the programs mentioned here. 788 2000) is commercial software for coding and analysis of digital video records; NVivo (Bazeley and Richards, 2000) is commercial software for accessing, shaping, managing, and analyzing non-numerical qualitative data; Multiple Episode Protocol Analysis (MEPA) (Erkens, 2002) is free software for annotating, coding, and analyzing both nonverbal and verbal protocols; and ACT Pro (Fu, 2001) can be used for sequential analysis of protocols of discrete user actions such as mouse clicks and key presses. Eye-Movement Data For analysis of fixation data it is important to identify the gaze data points that together represent fixations. This is necessary because during fixation the eyes are not entirely motionless; small tremors and drifts may occur (Duchowski, 2003). According to Salvucci (1999), the three categories of fixation identification methods are based on velocity, dispersion, or region. Most eye-tracking software allows for defining values for the dispersion-based method, which identifies fixation points as a minimum number of data points that are grouped closely together (i.e., fall within a certain dispersion, defined by pixels) and last a minimum amount of time (duration threshold). Once fixations have been defined, defining areas of interest (AoIs) in the stimulus materials will make analysis of the huge data files more manageable by allowing summaries of fixation data to be made for each AoI, such as the number of fixations, the mean fixation duration, and the total time spent fixating. Furthermore, a chronological listing of fixations on AoIs can be sequentially analyzed to detect patterns in viewing behavior. Verbal Protocol Data When verbal protocols have been transcribed, they can be segmented and coded. Segmentation based on utterances is highly reliable because it uses pauses in natural speech (Ericsson and Simon, 1993); however, many researchers apply segmentation based on meaning (Taylor and Dionne, 2000). In this case, segmentation and coding become intertwined, and the reliability of both should be evaluated. It is, again, important to use multiple raters (at least on a substantial subset of data) and determine the reliability of the coding scheme. The standard work by Ericsson and Simon (1993) provides a wealth of information on verbal protocol coding and analysis techniques. The software program MEPA (Erkens, 2002) can assist in the development of a coding scheme for verbal data, as well as in analysis of coded data with a variety of quantitative or qualitative methods. Data Collection and Analysis Combining Methods and Measures Discussion As mentioned before, there is not a preferred single method for the assessment of complex performances. By combining different methods and measures, a more complete or a more detailed picture of performance will be obtained; for example, various process-tracing techniques such as eye tracking and verbal reporting can be collected and analyzed in combination with other methods of assessment (van Gog et al., 2005a). Different product and process measures can easily be combined, and it can be argued that some of them should be combined because a simple performance score* ignores the fact that, with expertise development, time on task and cognitive load decrease, whereas performance increases. Consider the example of a student who attains the same performance score on two comparable tasks that are spread over time, where cognitive load measures indicate that the learner had to invest a lot of mental effort to complete the task the first time and little the second. Looking only at the performance score, one might erroneously conclude that no progress was made, whereas the learner actually made a subtle step forward, because reduced cognitive load means that more capacity can be devoted to further learning. The mental efficiency measure developed by Paas and van Merriënboer (1993) reflects this relation: Higher performance with less mental effort invested to attain that performance results in higher efficiency. This measure is obtained by standardizing performance and mental effort scores, and then subtracting the mean standardized mental effort score (zE) from the mean standardized performance score (zP) and dividing the outcome by the square root of 2: Much of the research into learning and instruction involves assessment of complex performances of cognitive tasks. The focus of this chapter section was on data collection and analysis methods that can be used for such assessments. First, the important issues related to selecting an appropriate collection of assessment tasks and defining appropriate assessment criteria and standards were discussed. Then, different ways for collecting performance product and process data, using online (during task performance) or offline (after task performance) measurements, were described. Analysis techniques were discussed and, given the lack of a single preferred method for complex performance assessment, ways to combine measures were suggested that will foster a more complete or more detailed understanding of complex performance. This chapter section aimed to provide an overview of the important issues in assessment of complex performance on cognitive tasks and of available data collection and analysis techniques for such assessments, rather than any definite guidelines. The latter would be impossible when writing for a broad audience, because what constitutes an appropriate collection of tasks, appropriate criteria and standards, and appropriate data collection and analysis techniques is highly dependent on the research question one seeks to address and on the domain in which one wishes to do so. We hope that this overview, along with other chapter sections, provides the reader with a starting point for further development of rewarding and informative studies. zP − zE 2 When tasks are performed under time constraints, the combination of mental effort and performance measures will suffice; however, when time on task is selfpaced, it is useful to include the additional time parameter in the efficiency measure (making it three-dimensional) (Paas et al., 2003; Tuovinen and Paas, 2004): zP − zE − zT 3 * This term is somewhat ambiguous, as we have previously classified mental effort and time-on-task data as performance process data. We feel they should be regarded as such; however, in the literature performance score is often used to refer to the grade assigned to a solution or solution procedure, which is the sense in which the term is used in this subsection. SETTING UP A LABORATORY FOR MEASUREMENT OF COMPLEX PERFORMANCES Aaron R. Duley, Paul Ward, and Peter A. Hancock This chapter section describes how to set up laboratories for the measurement of complex performance. Complex performance in this context does not exclusively refer to tasks that are inherently difficult to perform; rather, the term is used here in a broader sense to refer to the measurement of real-world activities that require the integration of disparate measurement instrumentation as well as the need for time-critical experimental control. We have assumed that our primary readership is comprised of graduate students and research faculty, although the chapter addresses issues relevant to all who seek a better understanding of behavioral response. 789 Tamara van Gog, Fred Paas et al. The central theme of this section relates to laboratory instrumentation. Because instrumentation is a requisite element for complex performance measurement, a common problem encountered by researchers is how to overcome the various technical hurdles that often discourage the pursuit of difficult research objectives. Thus, creating a testing environment suitable to address research questions is a major issue when planning any research program; however, searching the literature for resources relating to laboratory instrumentation configurations yields a surprisingly scant number of references and resources that address these issues. Having made just such an attempt for the purposes of this section, the ability to articulate a generalpurpose exposition on laboratory setup is indeed a challenging endeavor. This pursuit is made more difficult by addressing a naturally ambiguous topic such as complex performance; nevertheless, our section looks to provide the bearings needed to resolve such questions. In particular, we cover stimulus presentation and control alternatives, as well as hardware choices for signal routing and triggering, while offering solutions for commonly encountered problems when attempting to assemble such a laboratory. Some portions of this section are moderately technical, but every attempt has been made to ensure that the content is appropriate for our target audience. Instrumentation and Common Configurations Psychology has a long legacy of employing tools and instrumentation to support scientific inquiry. The online Museum of the History of Psychological Instrumentation, for example, has illustrations of over 150 devices used by early researchers to visualize organ function and systematically investigate human psychological processes and behavior (see http://www.chss.montclair. edu/psychology/museum/museum.htm). At this museum, one can view such devices as an early Wundtstyle tachistiscope or the Rotationsapparatus für Komplikations-Versuche (rotary apparatus for complication studies). Titchener, a student of Wundt, continued this tradition in his own laboratory at Cornell University and described the building requirements and the costs associated with items needed for establishing the ideal psychological laboratory (Titchener, 1900, pp. 252–253): For optics, there should be two rooms, light and dark, facing south and north respectively, and the later divided into antechamber and inner room. For acoustics, there should be one large room, connected directly with a small, dark, and (so far as is possible without special construc- 790 tion) sound-proof chamber. For haptics, there should be a moderately sized room, devoted to work on cutaneous pressure, temperature, and pain, and a larger room for investigations of movement perceptions. Taste and smell should each have a small room, the latter tiled or glazed, and so situated that ventilation is easy and so does not involve the opening of doors or transom-windows into the building. There should, further, be a clock-room, for the time-registering instruments and their controls; and a large room for the investigations of the bodily processes and changes underlying affective consciousness. Instrumentation is a central component of complex performance measurement; however, the process by which one orchestrates several devices in the broader context of addressing an experimental question is indeed challenging. Modern-day approaches reflect a paradigm shift with respect to early psychological procedures. Traditionally, a single instrument would be used for an entire experiment. Complex performance evaluation, however, often entails situations where the presentation of a stimulus is controlled by one computer, while supplementary instrumentation collects a stream of other data on a second or perhaps yet a third computer. Certainly, an ideal testing solution would allow one to minimize the time needed to set up an experiment and maximize the experimental degree of automation, thus minimizing investigator intervention, without compromising the scientific integrity of the experiment. Nevertheless, the measurement of complex performance is often in conflict with this idyllic vision. It is not sufficient for contemporary researchers simply to design experiments. They are also required to have access to the manpower and the monetary or computational resources necessary to translate a scientific question into a tenable methodological test bed. Design Patterns for Laboratory Instrumentation Design patterns represent structured solutions for such recurring assessment problems (Gamma et al., 1995). The formal application of design patterns as abstract blueprints for common challenges has relevance for laboratory instrumentation configuration and equipment purchasing decisions. Although research questions vary, experiments will regularly share a comparable solution. These commonalities are important to identify, as the ability to employ a single set of tools has distinct advantages compared to solutions tailored for only one particular problem. Such advantages include cost savings, instrumentation sharing, instrumentation longevity, and laboratory scalability (e.g., the capacity to run multiple experiments simultaneously). Data Collection and Analysis Model Presentation Layer Example Monitor VGA or DVI Computing Target Stimulus, Control, and Response Layer SCRL Application Figure 55.2 Stimulus and presentation control model. The purpose of the following section is to provide a level of abstraction for instrumentation configurations commonly encountered in the design of experiments related to complex performance. This approach is favored beyond simply providing a list of items and products that every laboratory should own. We acknowledge the considerable between-laboratory variability regarding research direction, instrumentation, and expertise; therefore, we focus primarily on instrumentation configuration and architecture as represented by design patterns common to a broad array of complex performance manipulations. Given that an experiment will often require the manipulation of stimuli in a structured way and seeing the impracticality in comprehensively covering all research design scenarios, the following assumptions are made: (1) Stimuli are physically presented to participants, (2) some stimulus properties are required to be under experimental control (e.g., presentation length), (3) measurable responses by the participant are required, and (4) control or communication of secondary instrumentation may also be necessary. These assumptions accommodate a broad spectrum of possible designs and from these assumptions several frameworks can be outlined. Stimulus Presentation and Control Model Figure 55.2 depicts the simplest of the design patterns, which we term the stimulus presentation and control (SPC) model. The SPC model is a building block for more complex configurations. The basic framework of the SPC model includes the presentation layer and the stimulus, control, and response layer. The presentation layer represents the medium used to physically display a stimulus to a participant (e.g., monitor, projector, speaker, headphones). The stimulus, control, and response layer (SCRL) encapsulates a number of interrelated functions central to complex performance experimentation, such as the experimental protocol logic, and is the agent that coordinates and controls experimental flow and, potentially, participant response. Broadly speaking, SCRL-type roles include stimulus manipulation and timing, instrument logging and coordination, and response logging, in addition to experiment procedure management. As the SCRL often contains the logic necessary to execute the experimental paradigm, it is almost always implemented in software; thus, the SCRL application is assumed to operate on a computing target (e.g., desktop, portable digital assistant), which is represented by the dashed box in Figure 55.2. As an example implementation of the SPC model, consider a hypothetical experiment in which participants are exposed to a number of visual stimuli for 6 sec each. Each visual stimulus occurs after a fixed foreperiod of 1 sec and a subsequent fixation cross (i.e., the point at which participants are required to direct their gaze) presented for 500 msec. Each visual stimulus is followed by a 2-sec inter-trial interval (ITI). The only requirement of the participant is to view the visual stimuli for the duration of its presentation. How do we implement this experiment? This problem has several possible solutions. A monitor (presentation layer) and Microsoft PowerPoint (SCRL) would easily accomplish the task; however, the SPC model is suitable to handle an extensive arrangement of experimental designs, so additional procedural requirements increase the need for added SCRL functionality. Consider an experiment where both a monitor and speakers are required to present the stimuli. This basic pattern still reflects an SPC 791 Tamara van Gog, Fred Paas et al. TABLE 55.2 SCRL-Type Applications Name Cogent 2000/ Cogent Graphics DMDX E-Prime Flashdot FLXLab PEBL (Psychology Experiment Building Language) PsychoPy PsyScope PsyScript PyEPL (Python Experiment-Programming Library) Realtime Experiment Interface SuperLab Description Type Platform Complete PC-based software environment for functional brain mapping experiments; contains commands useful for presenting scannersynchronized visual stimuli (Cogent Graphics), auditory stimuli, mechanical stimuli, and taste and smell stimuli. It is also used in monitoring key presses and other physiological recordings from the subject. Win 32-based display system used in psychology laboratories around the world to measure reaction times to visual and auditory stimuli. Suite of applications to design, generate, run, collect data, edit, and analyze the data; includes: (1) a graphical environment that allows visual selection and specification of experimental functions; (2) a comprehensive scripting language; (3) data management and analysis tools. Program for generating and presenting visual perceptual experiments that require a high temporal precision. It is controlled by a simple experiment building language and allows experiment generation with either a text or a graphical editor. Program for running psychology experiments; capabilities include presenting text and graphics, playing and recording sounds, and recording reaction times via the keyboard or a voice key. New language specifically designed to be used to create psychology experiments. Freeware Windows Freeware Windows Commercial Windows Freeware Windows, Linux Freeware Windows, Linux Freeware Linux, Windows, Mac Freeware Linux, Mac Freeware Mac Freeware Linux, Mac Freeware Linux, Mac Freeware Linux Commercial Windows, Mac Psychology stimulus software for Python; combines the graphical strengths of OpenGL with the easy Python syntax to give psychophysics a free and simple stimulus presentation and control package. Interactive graphic system for experimental design and control on the Macintosh. Application for scripting psychology experiments, similar to SuperLab, MEL, or E-Prime Library for coding psychology experiments in Python; supports presentation of both visual and auditory stimuli, as well as both manual (keyboard/joystick) and sound (microphone) input as responses. Extensible hard real-time platform for the development of novel experiment control and signal-processing applications. Stimulus presentation software with features that support the presentation of multiple types of media as well as rapid serial visual presentation paradigms and eye tracking integration, among other features. model and PowerPoint could be configured to present auditory and visual stimuli within a strict set of parameters. On the other hand, a real experiment would likely require that the foreperiod, fixations cross, and ITI appear with variable and not fixed timing. Presentation applications like PowerPoint, however, are not specifically designed for experimentation. As such, limitations are introduced as experimental designs become more elaborate. One solution to this problem is to use the Visual Basic for Applications (VBA) functionality embedded in Microsoft Office; however, requiring features such as variable timing, timing determinism (i.e., executing a task in the exact amount of time specified), support for randomization and 792 counterbalancing, response acquisition, and logging illustrates the advantages for obtaining a flexible SCRL application equipped for research pursuits. A number of commercial and freeware applications have been created over the past several decades to assist researchers with SCRL-type functions. The choice to select one application over the other may have much to do with programming requirements, the operating system platform, protocol requirements, or all of the above. Table 55.2 provides a list of some of the SCRL applications that are available for psychological and psychophysical experiments. Additional information for these and other SCRL applications can be found in Florer (2007). The description column is Data Collection and Analysis Model Presentation Layer Example Monitor Headphones VGA/DVI Sound Output Computing Target Stimulus, Control, and Response Layer LabVIEW Interface Layer Parallel Port DAQ Instrumentation Layer Eye Tracker BioInstrumentation Figure 55.3 SPC model with external hardware. text taken directly from narrative about the product provided by Florer (2007). A conventional programming language is best equipped for SCRL functionality. This alternative to the applications listed in Table 55.2 may be necessary for experiments where communication with external hardware, interfacing with external code, querying databases, or program performance requirements are a priority, although some of the SCRL applications listed in Table 55.2 provide some varying degrees of these capabilities (e.g., EPrime, SuperLab). The prospect of laboratory productivity can outweigh the flexibility and functionality afforded by a programming language; for example, from a laboratory management perspective, it is reasonable for all laboratory members to have a single platform from which they create experiments. Given the investment required to familiarize oneself with a programming language, the single platform option can indeed be challenging to implement in practice. Formulating a laboratory in this manner does allow members to share and reuse previous testing applications or utilize knowledge about the use and idiosyncrasies of an SCRL application. Despite the learning curve, a programming language has potential benefits that cannot be realized by turnkey SCRL applications. As mentioned, high-level programming languages offer inherently greater flexibility. Although it is important to consider whether the SCRL application can be used to generate a particular test bed, one must also consider the analysis requirements following initial data collection. The flexibility of a programming language can be very helpful in this regard. One might also consider the large support base in the form of books, forums, and websites dedicated to a particular language which can mitigate the problems that may arise during the learning process. Stimulus Presentation and Control Model with External Hardware Communicating with external hardware is essential to complex performance design. Building upon the basic SPC framework, Figure 55.3 depicts the SPC model with support for external hardware (SPCxh). Figure 55.3 illustrates a scenario where the SCRL controls both monitor and headphone output. The SCRL also interfaces with an eye tracker and a bio-instrumentation device via the parallel port and a data acquisition device (DAQ), respectively. DAQ devices are an important conduit for signal routing and acquisition, and we discuss them in greater detail later. Extensions of the SPCxh and the SPC model are the interface and instrumentation layers. A good argument can be made for another interface layer to exist between the presentation layer and the SCRL, but for our purposes the interface layer specifically refers to the physical connection that exists between the SCRL and the instrumentation layer. Figure 55.3 depicts the stimulus presentation and control model with external hardware support (SPCxh). The SPCxh is derived from the basic 793 Tamara van Gog, Fred Paas et al. Example Bio-Instrumentation Monitor Digital Input/Output Computing Target Network to Computing Target LabVIEW DAQ Device DAQ Network LabVIEW BioInstrumentation Figure 55.4 A real-world example of the SPCxh model. SPC model, with two additional layers: one to represent the external hardware, the second to interface that hardware with the SCRL. It is important to emphasize that the SPC and SPCxh models are only examples. We recognize that an actual implementation of any one model will most certainly differ among laboratories. The main purpose of illustrating the various arrangements in this manner is to address the major question of how complex performance design paradigms are arranged in an abstract sense. When the necessary components are identified for their specific research objective, then comes the process of determining the specific hardware and software to realize that goal. It is imperative to understand the connection between a given model (abstraction) and its real-world counterpart. Using the example described above, consider an experiment that requires the additional collection of electrocortical activity in response to the appearance of the visual stimulus. This type of physiological data collection is termed event-related potential, as we are evaluating brain potentials time-locked to some event (i.e., appearance of the visual stimulus in this example). Thus, we need to mark in the physiological record where this event appears for offline analysis. Figure 55.4 depicts one method to implement this requirement. On the left, the SPCxh model diagram is illustrated for the current scenario. A monitor is used to display the stimulus. A programming lan- 794 guage called LabVIEW provides the SCRL functionality. Because the bio-instrumentation supports digital input/output (i.e., hardware that allows one to send and receive digital signals), LabVIEW utilizes the DAQ device to output digital markers to the digital input ports of the bio-instrumentation while also connecting to the instrument to collect the physiological data over the network. We are using the term bio-instrumentation here to refer to the hardware used for the collection and assessment of physiological data. The picture on the right portrays the instantiation of the diagram. It should be observed that the diagram is meant to represent tangible software and hardware entities. Although LabVIEW is used in this example as our SCRL application, any number of alternatives could have also been employed to provide the linkage between the software used in our SCRL and the instrumentation. A number of derivations can also be organized from the SPCxh model; for example, in many cases, the SCRL may contain only the logic needed to execute the experiment but not the application program interfaces (APIs) required to directly control a vendor’s hardware. In these situations, it may be necessary to run the SCRL application alongside (e.g., on another machine) the vendor-specific hardware application. Figure 55.5 depicts this alternative, where the vendorspecific hardware executes its procedures at the same time as the SCRL application. Because the layers for Data Collection and Analysis Option 1 Monitor Option 2 Headphones VGA/DVI Headphones Monitor Sound Output Computing Target VGA/DVI Sound Output Computing Target 2 Computing Target 1 SCRL Application Device Software Parallel Port DAQ Eye Tracker Bio-Instrumentation Network Interface SCRL Application Parallel Port DAQ Computing Target 3 Device Software Device Software Network Interface Serial Port Bio-Instrumentation Eye Tracker Figure 55.5 SCLEs application operating concurrently with software. the SPCxh are the same as in Figure 55.3, Figure 55.5 depicts only the example instantiation of the model and not the layers for the SPCxh model. The SPCxh example in Figure 55.5 is a common configuration because hardware vendors do not always supply software interfaces that can be used by an external application. The major difference between the two options, as shown, is that the second option would require a total of three computing targets: one to execute the SCRL application and for stimulus presentation, one to execute the bio-instrumentation device software, and a third to execute the eye tracker device software. A very common question is how to synchronize the SCRL application and the device instrumentation. As with the previous example, the method of choice for the example is via the DAQ device for communication with the bio-instrumentation and through the parallel port for the eye tracker; however, the option for event marking and synchronization is only applicable if it is supported by the particular piece of instrumentation. Furthermore, the specific interface (e.g., digital input/output, serial/parallel) is dependent on what the manufacturer has made available for the enduser. Given this information, one should ask themselves the following questions prior to investing resources in any one instrument or SCRL alternative. First, what type of limitations will be encountered when attempting to interface a particular instrument with my current resources? That is, does the manufacturer provide support for external communication with other instruments or applications? Second, does my SCRL application of choice support a method to com- municate with my external hardware if this option is available? Third, does the device manufacturer provide or sell programming libraries or application program interfaces if data collection has to be curtailed in some particular way? Fourth, what are the computational requirements to run the instrumentation software? Will the software be so processor intensive that it requires a sole execution on one dedicated machine? Common Paradigms and Configurations A number of common paradigms exist in psychological research, from recall and recognition paradigms to interruption-type paradigms. Although it is beyond the scope of this section to provide an example configuration for each, we have selected one example that is commonly used by many of contemporary researchers; this is the secondary task paradigm. Both the SPC and SPCxh models are sufficient for experiments employing this paradigm; however, a common problem can occur when the logic for the primary and secondary tasks is implemented as mutually exclusive entities. An experiment that employs a simulator for the primary task environment can be viewed as a self-contained SPCxh model containing a simulation presentation environment (presentation layer), simulation control software (SCRL application), and simulationspecific hardware (interface and instrumentation layers). The question now is how can we interface the primary task (operated by the simulator in this example) with another SPRL application that contains the logic for the secondary task? 795 Tamara van Gog, Fred Paas et al. Primary Task Simulator Presentation Environment Secondary Task Monitor VGA/DVI Computing Target Simulator Control Software Network Interface Headphones Sound Output Computing Target Device Software SCRL Application Parallel Port DAQ Network Interface Digital Out Simulator Control Hardware Eye Tracker BioInstrumentation Figure 55.6 Secondary task paradigm. Figure 55.6 contains a graphical representation of a possible configuration: an SPCxh model for the simulator communicating via a network interface that is monitored from the SCRL application on the secondary task side. On the left side of Figure 55.6 is the primary task configuration, and on the right side is the secondary task configuration. It should be noted that the simulator control software, our SCRL application, or the device-specific software does not necessarily need to be executed on separate computers; however, depending on the primary or secondary task, one may find that the processor and memory requirements necessitate multiple computers. On the secondary task side, the diagram represents a fairly complex role for the SCRL application. As shown, the SCRL application has output responsibilities to a monitor and headphones while also interfacing with an eye tracker via the serial port, interfacing the simulator via the network, and sending two lines of digital output information via the DAQ device. Numerous complex performance test beds require that a primary and secondary task paradigm be used to explicate the relationship among any number of processes. In the field of human factors, in particular, it is not uncommon for an experiment to employ simulation for the primary task and then connect a secondary task to events that may occur during the primary task. A major problem often encountered in simulation research is that the simulators are often closed-systems; nevertheless, most simulators can be viewed as SPCxh arrangements with a presentation layer of some kind, an application that provides SCRL 796 function, and the simulator control hardware itself. If one wishes to generate secondary task events based on the occurrence of specific events in the simulation, the question then becomes one of how we might go about configuring such a solution when there is no readymade entry point between the two systems (i.e., primary task system and secondary task system). The diagram on the left shows the SPCxh model for the secondary task, which is responsible for interacting with a number of additional instruments. The diagram shows a connecting line between the secondary systems network interface and the network interface of the primary task as controlled by the simulation. Because simulation manufacturers will often make technotes available that specify how the simulator may communicate with other computers or hardware, one can often ascertain this information for integration with the secondary tasks SCRL. Summary of Design Configurations The above examples have been elaborated in limited detail. Primarily, information pertaining to how one would configure a SCRL application for external software and hardware communication has been excluded. This specification is not practical with the sheer number of SCRL options available to researchers. As well, the diagrams do not inform the role that the SCRL application plays when interfacing with instrumentation. It may be the case that the SCRL application plays a minimal role in starting and stopping the instrument and does not command full control of the instrument Data Collection and Analysis via a program interface; nevertheless, one should attempt to understand the various configurations because they do appear with great regularity in complex performance designs. Finally, it is particularly important to consider some of the issues raised when making purchasing decisions about a given instrument, interface, or application. General-Purpose Hardware It is evident when walking through a home improvement store that countless tools have proved their effectiveness for an almost limitless number of tasks (e.g., a hammer). An analog to this apparent fact is that various tools are also exceedingly useful for complex performance research; thus, the purpose of the following sections is to discuss some of these tools and their role in complex performance evaluation. Data Acquisition Devices Given the ubiquity of DAQ hardware in the examples above, it is critical that one has a general idea of the functionality that a DAQ device can provide. DAQ devices are the research scientist’s Swiss Army knife and are indispensable tools in the laboratory. DAQ hardware completes the bridge between the SCRL and the array of instrumentation implemented in an experiment; that is, the DAQ hardware gives a properly supported SCRL application a number of useful functions for complex performance measurement. To name a few examples, DAQ devices offer a means of transmitting important data between instruments, supports mechanisms to coordinate complex actions or event sequences, provides deterministic timing via a hardware clock, and provides methods to synchronize independently operating devices. A common use for DAQ devices is to send and receive digital signals; however, to frame this application within the context of complex performance design, it is important that one be familiar with a few terms. An event refers to any information as it occurs within an experiment; for example, an event may mark the onset or offset of a stimulus, a participant’s response, or the beginning or end of a trial. A common design obstacle requires that we know an event’s temporal appearance for the purpose of subsequent analyses or, alternatively, to trigger supplementary instrumentation. The term trigger is often paired with event to describe an action associated with an the occurrence of an event. In some cases, event triggers may be completely internal to a single SCRL application, but in other instances event triggers may include external communications between devices or systems. Data acquisition devices have traditionally been referred to as A/D boards (analog-to-digital boards) because of their frequent use in signal acquisition. A signal, in this context, loosely refers to any measurable physical phenomenon. Signals can be divided into two primary classes. Analog signals can vary continuously over an infinite range of values, whereas digital signals contain information in discrete states. To remember the difference between these two signals, visualize two graphs, one where someone’s voice is recorded (analog) and another plotting when a light is switched on or off (digital). The traditional role of a DAQ device was to acquire and translate measured phenomena into binary units that can be represented by a digital device (e.g., computer, oscilloscope). Suppose we need to record the force exerted on a force plate. A DAQ device could be configured such that we could sample the data derived from a force plate at a rate of 1 msec per sample and subsequently record that data to a computer or logging instrument. One should note that the moniker device as a replacement for board is more appropriate given that modern DAQ alternatives are not always self-contained boards but may connect to a computing target in a few different ways. DAQ devices are available for a number of bus types. A bus, in computing vernacular, refers to a method of transmission for digital data (e.g., USB, FireWire, PCI). The DAQ device pictured in Figure 55.4, for example, is designed to connect to the USB port of a computer. Despite their traditional role in signal acquisition (e.g., analog input), most DAQ devices contain an analog output option. Analog output reverses the process of an A/D conversion and can be used to convert digital data into analog data (D/A conversion). Analog output capabilities are useful for a variety of reasons; for example, an analog output signal can be used to produce auditory stimuli, control external hardware, or output analog data to supplementary instrumentation. The primary and secondary task example above illustrates one application for analog output. Recall that in the SPCxh model described earlier the simulator was a closed-system interfaced with the SCRL application via a network interface. Suppose that it was necessary to proxy events as they occurred in the simulation to secondary hardware. A reason for this approach might be as simple as collapsing data to a single measurement file; for example, suppose we want to evaluate aiming variability in a weapons simulation in tandem with a physiological measure. One strategy would require that we merge the separate streams of data after all events have been recorded, but, by using another strategy employing the analog output option, 797 Tamara van Gog, Fred Paas et al. EEG Signal Visual Stimulus Onset (leading edge) Visual Stimulus Offset (trailing edge) Figure 55.7 Event triggering and recording. we could route data from the simulator and proxy it via the SCRL application controlling the DAQ device connected to our physiological recording device. In addition to analog output, DAQ functionality for digital input/output is a critical feature to solving various complex performance measurement issues. Recall the example above where an experiment required that the SCRL application tell the physiological control software when the visual stimulus was displayed. This was accomplished by the SCRL application sending a digital output from the DAQ device to the digital input port on the bio-instrument. Figure 55.7 depicts this occurrence from the control software for the bioinstrument. The figure shows one channel of data recorded from a participant’s scalp (i.e., EEG); another digital channel represents the onset and offset of the visual stimulus. When configuring digital signals, it is also important that one understand that the event can be defined in terms of the leading or trailing edge of the digital signal. As depicted in Figure 55.7, the leading edge (also called the rising edge) refers to the first positive deflection of the digital waveform, while the trailing edge (also called the falling edge) refers to the negative going portion of the waveform. This distinction is important, because in many cases secondary instrumentation will provide an option to begin or end recording from the leading or trailing edge; that is, if we mistakenly begin recording on the trailing edge when the critical event occurs on the leading edge, then the secondary instrument may be triggered late or not at all. Another term native to digital events is transistor–transistor logic (TTL), which is often used to express digital triggering that operates within spe798 cific parameters. TTL refers to a standard where a given instrument’s digital line will change state (e.g., on to off) if the incoming input is within a given voltage range. If the voltage supplied to the digital line is 0 then it is off, and if the digital line is supplied 5 volts then it is on. Event triggering is an extremely important constituent of complex performance experimentation. Knowing when an event occurs may be vital for data analysis or for triggering subsequent events. The example here depicts a scenario where we are interested in knowing the onset occurrence of a visual stimulus so we can analyze our EEG signal for event-related changes. The channel with the square wave is the event that tells when the event occurred, with the leading edge representing the onset of the visual stimuli (5 volts) and the falling edge reflecting its offset (0 volts). A setup similar to that demonstrated in Figure 55.4 could easily be configured to produce such an example. Although this example only shows two channels, a real-world testing scenario may have several hundred to indicate when certain events occur. One strategy is to define different events as different channels. One channel may represent the visibility of a stimuli (on or off), another may represent a change in its color, and yet another may indicate any other number of events. An alternative solution is to configure the SCRL application to send data to a single channel and then create a coding scheme to reflect different events (e.g., 0 volts, stimulus hidden; 2 volts, stimulus visible; 3 volts, color changed to black; 4 volts, color changed to white). This approach reduces the number of channels and maximizes the number of digital channels that one can control. Data Collection and Analysis The utility of digital event triggering cannot be overstated. Although its application for measurement of complex performance requires a small degree of technical expertise, the ability to implement event triggering affords a great degree of experimental flexibility. Under certain circumstances, however, analog triggering may also be appropriate. Consider an experiment where the threshold of a participant’s voice serves as the eliciting event for a secondary stimulus. In this case, it is necessary to determine whether or not the SCRL application and DAQ interface might support this type of triggering, because such an approach offers greater flexibility for some design configurations. Purchasing a DAQ Device The following questions are relevant to purchasing a DAQ device for complex performance research. First, is the DAQ device going to be used to collect physiological or other data? If the answer is yes, one should understand that the price of a DAQ device is primarily a function of its resolution, speed, form factor, and the number of input/output channels supported. Although a discussion of unipolar vs. bipolar data acquisition is beyond the scope of this chapter, the reader should consult Olansen and Rosow (2002) and Stevenson and Soejima (2005) for additional information on how this may effect the final device choice. Device resolution, on the other hand, refers to the fidelity of a DAQ device to resolve analog signals; that is, what is the smallest detectable change that the device can discriminate? When choosing a board, resolution is described in terms of bits. A 16-bit board can resolve signals with greater fidelity than a 12-bit board. By taking the board resolution as an exponent of 2, one can see why. A 12-bit board has 212 or 4096 possible values, while a 16-bit has a 216 or 65,536 possible values. The answer to this question is also a function of a few other factors (e.g., signal range, amplification). A complete understanding of these two major issues should be established prior to deciding on any one device. As well, prior to investing in a higher resolution board, which will cost more money than a lower resolution counterpart, one should evaluate what is the appropriate resolution for the application. If the primary application of the DAQ device is for digital event triggering, then it is important to purchase a device that is suitable for handling as many digital channels as are necessary for a particular research design. Third, does the design require analog output capabilities? Unlike digital channels, which are generally reconfigurable for input or output, analog channels are not, so it is important to know in advance the number of analog channels that a DAQ device supports. Fourth, does the testing environment require a level of timing determinism that cannot be provided by software in less than 1 msec? For these scenarios, researchers might want to consider a DAQ device that supports hardware timing. For additional information about A/D board specifications and other factors that may affect purchasing decisions, see Staller (2005). Computers as Instrumentation Computers are essential to the modern laboratory. Their value is evident when one considers their versatile role in the research process; consequently, the computer’s ubiquity in the science can account for significant costs. Because academic institutions will often hold contracts with large original equipment manufactures, computing systems are competitively priced and warranties ensure maintenance for several years. Building a system from the ground up is also a viable option that will often provide a cost-effective alternative to purchasing through an original equipment manufacturer. Although the prospect of assembling a computer may sound daunting, the process is really quite simple, and numerous books and websites are dedicated to the topic (see, for example, Hardwidge, 2006). Customization is one of the greatest advantages to building a machine, and because only necessary components are purchased overall cost is usually reduced. On the other hand, a major disadvantage to this approach is the time associated with reviewing the components, assembling the hardware, and installing the necessary software. Most new computers will likely be capable of handling a good majority of laboratory tasks; however, one should have a basic understanding of its major components to make an informed purchasing decision when planning complex performance test beds. This is important given that one can potentially save considerable resources that can be allocated for other equipment and be assured that the computer is adequate for a given experimental paradigm. The following questions should be considered whether building or buying a complete computing system. First, does the number of expansion ports accommodate input boards that may be needed to interface instrumentation? For example, if an instrument interfaced an application via a network port and we wanted to maintain the ability to network with other computers or access the Internet, it would be important to confirm that the motherboard of the computer had a sufficient number of slots to accommodate this addition. Furthermore, because DAQ devices are often sold as input boards this same logic would apply. Computers have evolved from general-purpose machines to machines with specific aptitudes for particular tasks. A recent development is that vendors now 799 Tamara van Gog, Fred Paas et al. market a particular computing system for gaming vs. video-editing vs. home entertainment purposes. To understand the reasons behind these configurations, we strongly advocate developing a basic understanding of how certain components facilitate particular tasks. Although space prevents us from accomplishing this within this chapter, it is important to realize that computing performance can alter timing determinism, particularly in complex performance environments. Discussion The challenge of understanding the various technical facets of laboratory setup and configuration represents a major hurdle when the assessment of some complex performance is a central objective. This section has discussed the common problems and design configurations that one may encounter when setting up such a laboratory. This approach, abstract in some respects, was not intended to illustrate the full range of design configurations available for complex performance evaluation; rather, the common configurations discussed here should only be viewed as general-purpose architectures, independent of new technologies that may emerge. After developing an understanding of the various design configurations, one must determine the specific hardware and software that are required to address the research question. The purpose of providing a few design configurations here was to emphasize that, in many complex performance testing environments, one must specify what equipment or software will fill the roles of presentation, stimulus control and response, instrumentation, and their interfaces. CONCLUDING REMARKS Setting up laboratories for the measurement of complex performances can indeed be a challenging pursuit; however, becoming knowledgeable about the solutions and tools available to aid in achieving the research objectives is rewarding on a number of levels. The ability to identify and manipulate multiple software and hardware components allows quick and effective transitioning from a research question into a tenable methodological test bed. REFERENCES Adler, P. A. and Adler, P. (1994). Observational techniques. In Handbook of Qualitative Research, edited by N. K. Denzin and Y. S. Lincoln, pp. 377–392. Thousand Oaks, CA: Sage.* Airasian, P. W. (1996). Assessment in the Classroom. New York: McGraw-Hill. 800 Alavi, M. (1994). Computer-mediated collaborative learning: an empirical evaluation. MIS Q., 18, 159–174. American Evaluation Association. (2007). Qualitative software, www.eval.org/Resources/QDA.htm. Ancona, D. G. and Caldwell, D. F. (1991). Demography and Design: Predictors of New Product Team Performance, No. 3236-91. Cambridge, MA: MIT Press. Anderson, J. R. (1993). Rules of the Mind. Hillsdale, NJ: Lawrence Erlbaum Associates. Anderson, J. R. and Lebiere, C. (1998). The Atomic Components of Thought. Mahwah, NJ: Lawrence Erlbaum Associates. Anderson, J. R., Reder, L. M., and Simon, H. A. (1996). Situated learning and education. Educ. Res., 25(4), 5–11. Arter, J. A. and Spandel, V. (1992). An NCME instructional module on: using portfolios of student work in instruction and assessment. Educ. Meas. Issues Pract., 11, 36–45. Aviv, R. (2003). Network analysis of knowledge construction in asynchronous learning networks. J. Asynch. Learn. Netw., 7(3), 1–23. Ayres, P. (2006). Using subjective measures to detect variations of intrinsic cognitive load within problems. Learn. Instruct., 16, 389–400.* Bales, R. F. (1950). Interaction Process Analysis: A Method for the Study of Small Groups. Cambridge, MA: Addison-Wesley. Battistich, V., Solomon, D., and Delucchi, K. (1993). Interaction processes and student outcomes in cooperative learning groups. Element. School J., 94(1), 19–32. Bazeley, P. and Richards, L. (2000). The NVivo Qualitative Project Book. London: SAGE. Bellman, B. L. and Jules-Rosette, B. (1977). A Paradigm for Looking: Cross-Cultural Research with Visual Media. Norwood, NJ: Ablex Publishing. Birenbaum, M. and Dochy, F. (1996). Alternatives in assessment of achievements, learning processes and prior knowledge. Boston, MA: Kluwer. Bjork, R. A. (1999). Assessing our own competence: heuristics and illusions. In Attention and Performance. Vol. XVII. Cognitive Regulation of Performance: Interaction of Theory and Application, edited by D. Gopher and A. Koriat, pp. 435–459. Cambridge, MA: MIT Press. Bogaart, N. C. R. and Ketelaar, H. W. E. R., Eds. (1983). Methodology in Anthropological Filmmaking: Papers of the IUAES Intercongress, Amsterdam, 1981. Gottingen, Germany: Edition Herodot. Bogdan, R. C. and Biklen, S. K. (1992). Qualitative Research for Education: An Introduction to Theory and Methods, 2nd ed. Boston, MA: Allyn & Bacon.* Boren, M. T. and Ramey, J. (2000). Thinking aloud: reconciling theory and practice. IEEE Trans. Prof. Commun., 43, 261–278. Borg, W. R. and Gall, M. D. (1989). Educational Research: An Introduction, 5th ed. New York: Longman. Bowers, C. A. (2006). Analyzing communication sequences for team training needs assessment. Hum. Factors, 40, 672–678.* Bowers, C. A., Jentsch, F., Salas, E., and Braun, C. C. (1998). Analyzing communication sequences for team training needs assessment. Hum. Factors, 40, 672–678.* Bridgeman, B., Cline, F., and Hessinger, J. (2004). Effect of extra time on verbal and quantitative GRE scores. Appl. Meas. Educ., 17(1), 25–37. Brünken, R., Plass, J. L., and Leutner, D. (2003). Direct measurement of cognitive load in multimedia learning. Educ. Psychol., 38, 53–61. Data Collection and Analysis Byström, K. and Järvelin, K. (1995). Task complexity affects information seeking and use. Inform. Process. Manage., 31, 191–213. Campbell, D. J. (1988). Task complexity: a review and analysis. Acad. Manage. Rev., 13, 40–52.* Camps, J. (2003). Concurrent and retrospective verbal reports as tools to better understand the role of attention in second language tasks. Int. J. Appl. Linguist., 13, 201–221. Carnevale, A., Gainer, L., and Meltzer, A. (1989). Workplace Basics: The Skills Employers Want. Alexandria, VA: American Society for Training and Development. Cascallar, A. and Cascallar, E. (2003). Setting standards in the assessment of complex performances: the optimised extended-response standard setting method. In Optimising New Modes of Assessment: In Search of Qualities and Standards, edited by M. Segers, F. Dochy, and E. Cascallar, pp. 247–266. Dordrecht: Kluwer. Chandler, P. and Sweller, J. (1991). Cognitive load theory and the format of instruction. Cogn. Instruct., 8, 293–332.* Charness, N., Reingold, E. M., Pomplun, M., and Stampe, D. M. (2001). The perceptual aspect of skilled performance in chess: evidence from eye movements. Mem. Cogn., 29, 1146–1152.* Chase, C. I. (1999). Contemporary Assessment for Educators. New York: Longman. Chen, H. (2005). The Effect of Type of Threading and Level of Self-Efficacy on Achievement and Attitudes in Online Course Discussion, Ph.D. dissertation. Tempe: Arizona State University. Christensen, L. B. (2006). Experimental Methodology, 10th ed. Boston, MA: Allyn & Bacon. Collier, J. and Collier, M. (1986). Visual Anthropology: Photography as a Research Method. Albuquerque, NM: University of New Mexico Press. Cooke, N. J. (1994). Varieties of knowledge elicitation techniques. Int. J. Hum.–Comput. Stud., 41, 801-849. Cooke, N. J., Salas E., Cannon-Bowers, J. A., and Stout R. J. (2000). Measuring team knowledge. Hum. Factors, 42, 151–173.* Cornu, B. (2004). Information and communication technology transforming the teaching profession. In Instructional Design: Addressing the Challenges of Learning Through Technology and Curriculum, edited by N. Seel and S. Dijkstra, pp. 227–238. Mahwah, NJ: Lawrence Erlbaum Associates.* Crooks, S. M., Klein, J. D., Jones, E. K., and Dwyer, H. (1995). Effects of Cooperative Learning and Learner Control Modes in Computer-Based Instruction. Paper presented at the Association for Communications and Technology Annual Meeting, February 8–12, Anaheim, CA. Cuneo, C. (2000). WWW Virtual Library: Sociology Software, http://socserv.mcmaster.ca/w3virtsoclib/software.htm Demetriadis, S., Barbas, A., Psillos, D., and Pombortsis, A. (2005). Introducing ICT in the learning context of traditional school. In Preparing Teachers to Teach with Technology, edited by C. Vrasidas and G. V. Glass, pp. 99–116. Greenwich, CO: Information Age Publishers. Dewey, J. (1916/1966). Democracy and Education: An Introduction to the Philosophy of Education. New York: Free Press. Dijkstra, S. (2004). The integration of curriculum design, instructional design, and media choice. In Instructional Design: Addressing the Challenges of Learning Through Technology and Curriculum, edited by N. Seel and S. Dijkstra, pp. 145–170. Mahwah, NJ: Lawrence Erlbaum Associates.* Downing, S. M. and Haladyna, T. M. (1997). Test item development: validity evidence from quality assurance procedures. Appl. Meas. Educ., 10(1), 61–82. Driscoll, M. P. (1995). Paradigms for research in instructional systems. In Instructional Technology: Past, Present and Future, 2nd ed., edited by G. J. Anglin, pp. 322–329. Englewood, CO: Libraries Unlimited.* Duchowski, A. T. (2003). Eye Tracking Methodology: Theory and Practice. London: Springer. Eccles, D. W. and Tenenbaum, G. (2004). Why an expert team is more than a team of experts: a social-cognitive conceptualization of team coordination and communication in sport. J. Sport Exer. Psychol., 26, 542–560. Ericsson, K. A. (2002). Attaining excellence through deliberate practice: insights from the study of expert performance. In The Pursuit of Excellence Through Education, edited by M. Ferrari, pp. 21–55. Hillsdale, NJ: Lawrence Erlbaum Associates. Ericsson, K. A. (2004). Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Acad. Med., 79(10), 70–81.* Ericsson, K. A. and Lehmann, A. C. (1996). Expert and exceptional performance: evidence for maximal adaptation to task constraints. Annu. Rev. Psychol., 47, 273–305. Ericsson, K. A. and Simon, H. A. (1980). Verbal reports as data. Psychol. Rev., 87, 215–251.* Ericsson, K. A. and Simon, H. A. (1984). Protocol Analysis: Verbal Reports as Data. Cambridge, MA: MIT Press.* Ericsson, K. A. and Simon, H. A. (1993). Protocol Analysis: Verbal Reports as Data, rev. ed. Cambridge, MA: MIT Press. Ericsson, K. A. and Smith, J., Eds. (1991). Toward a General Theory of Expertise: Prospects and Limits. Cambridge, U.K.: Cambridge University Press. Ericsson, K. A. and Staszewski, J. J. (1989). Skilled memory and expertise: mechanisms of exceptional performance. In Complex Information Processing: The Impact of Herbert A. Simon, edited by D. Klahr and K. Kotovsky, pp. 235–267. Hillsdale, NJ: Lawrence Erlbaum Associates. Erkens, G. (2002). MEPA: Multiple Episode Protocol Analysis, Version 4.8, http://edugate.fss.uu.nl/mepa/index.htm. Espey, L. (2000). Technology planning and technology integration: a case study. In Proceedings of Society for Information Technology and Teacher Education International Conference 2000, edited by C. Crawford et al., pp. 95–100. Chesapeake, VA: Association for the Advancement of Computing in Education. Florer, F. (2007). Software for Psychophysics, http://vision.nyu. edu/Tips/FaithsSoftwareReview.html. Fu, W.-T. (2001). ACT-PRO action protocol analyzer: a tool for analyzing discrete action protocols. Behav. Res. Methods Instrum. Comput., 33, 149–158. Fussell, S. R., Kraut, R. E., Lerch, F. J., Shcerlis, W. L., McNally, M. M., and Cadiz, J. J. (1998). Coordination, Overload and Team Performance: Effects of Team Communication Strategies. Paper presented at the Association for Computing Machinery Conference on Computer Supported Cooperative Work, November 14–18, Seattle, WA. Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (2005). Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley: Reading, MA. Garfinkel, H. (1967). Studies in Ethnomethodology: A Return to the Origins of Ethnomethodology. Englewood Cliffs, NJ: Prentice Hall. 801 Tamara van Gog, Fred Paas et al. Gerjets, P., Scheiter, K., and Catrambone, R. (2004). Designing instructional examples to reduce cognitive load: molar versus modular presentation of solution procedures. Instruct. Sci., 32, 33–58.* Gerjets, P., Scheiter, K., and Catrambone, R. (2006). Can learning from molar and modular worked examples be enhanced by providing instructional explanations and prompting selfexplanations? Learn. Instruct., 16, 104–121. Goetz, J. P. and LeCompte, M. D. (1984). Ethnography and Qualitative Design in Educational Research. Orlando, FL: Academic Press.* Goodyear, P. (2000). Environments for lifelong learning: ergonomics, architecture and educational design. In Integrated and Holistic Perspectives on Learning, Instruction, and Technology: Understanding Complexity, edited by J. M. Spector and T. M. Anderson, pp. 1–18. Dordrecht: Kluwer.* Goswami, U. (2004). Neuroscience and education. Br. J. Educ. Psychol., 74, 1–14. Gulikers, J. T. M., Bastiaens, T. J., and Kirschner, P. A. (2004). A five-dimensional framework for authentic assessment. Educ. Technol. Res. Dev., 52(3), 67–86.* Guzzo, R. A. and Shea, G. P. (1992). Group performance and intergroup relations in organizations. In Handbook of Industrial and Organizational Psychology Vol. 3, 2nd ed., edited by M. D. Dunnette and L. M. Hough, pp. 269–313. Palo Alto, CA: Consulting Psychologists Press. Haider, H. and Frensch, P. A. (1999). Eye movement during skill acquisition: more evidence for the information reduction hypothesis. J. Exp. Psychol. Learn. Mem. Cogn., 25, 172–190. Hambleton, R. K., Jaegar, R. M., Plake, B. S., and Mills, C. (2000). Setting performance standards on complex educational assessments. Appl. Psychol. Meas., 24, 355–366. Hara, N., Bonk, C. J., and Angeli, C. (2000). Content analysis of online discussion in an applied educational psychology course. Instruct. Sci., 28, 115–152. Hardwidge, B. (2006) Building Extreme PCs: The Complete Guide to Computer Modding. Cambridge, MA: O’Reilly Media. Heider, K. G. (1976). Ethnographic Film. Austin, TX: The University of Texas Press. Herl, H. E., O’Neil, H. F., Chung, G. K. W. K., and Schacter, J. (1999). Reliability and validity of a computer-based knowledge mapping system to measure content understanding. Comput. Hum. Behav., 15, 315–333. Higgins, N. and Rice, E. (1991). Teachers’ perspectives on competency-based testing. Educ. Technol. Res. Dev., 39(3), 59–69. Hobma, S. O., Ram, P. M., Muijtjens, A. M. M., Grol, R. P. T. M., and Van der Vleuten, C. P. M. (2004). Setting a standard for performance assessment of doctor–patient communication in general practice. Med. Educ., 38, 1244–1252. Hockings, P., Ed. (1975). Principles of Visual Anthropology. The Hague: Mouton Publishers. Horber, E. (2006). Qualitative Data Analysis Links, http://www. unige.ch/ses/sococ/qual/qual.html. Ifenthaler, D. (2005). The measurement of change: learningdependent progression of mental models. Technol. Instruct. Cogn. Learn., 2, 317–336.* Jeong, A. C. (2003). The sequential analysis of group interaction and critical thinking in online threaded discussions. Am. J. Distance Educ., 17(1), 25–43.* Johnson, D. W., Johnson, R. T., and Stanne, M. B. (2000). Cooperative Learning Methods: A Meta-Analysis, http:// www.co-operation.org/pages/cl-methods.html.* 802 Jones, E. K., Crooks, S., and Klein, J. (1995). Development of a Cooperative Learning Observational Instrument. Paper presented at the Association for Educational Communications and Technology Annual Meeting, February 8–12, Anaheim, CA. Jorgensen, D. L. (1989). Participant Observation: A Methodology for Human Studies. London: SAGE.* Katzir, T. and Paré-Blagoev, J. (2006). Applying cognitive neuroscience research to education: the case of literacy. Educ. Psychol., 41, 53–74. Kirschner, P., Carr, C., van Merrienboer, J., and Sloep, P. (2002). How expert designers design. Perform. Improv. Q., 15(4), 86–104. Klein, J. D. and Pridemore, D. R. (1994). Effects of orienting activities and practice on achievement, continuing motivation, and student behaviors in a cooperative learning environment. Educ. Technol. Res. Dev., 41(4), 41–54.* Klimoski, R. and Mohammed, S. (1994). Team mental model: construct or metaphor. J. Manage., 20, 403–437. Ko, S. and Rossen, S. (2001). Teaching Online: A Practical Guide. Boston, MA: Houghton Mifflin. Koschmann, T. (1996). Paradigm shifts and instructional technology. In Computer Supportive Collaborative Learning: Theory and Practice of an Emerging Paradigm, edited by T. Koschmann, pp. 1–23. Mahwah, NJ: Lawrence Erlbaum Associates.* Kuusela, H. and Paul, P. (2000). A comparison of concurrent and retrospective verbal protocol analysis. Am. J. Psychol., 113, 387–404. Langan-Fox, J. (2000). Team mental models: techniques, methods, and analytic approaches. Hum. Factors, 42, 242–271.* Langan-Fox, J. and Tan, P. (1997). Images of a culture in transition: personal constructs of organizational stability and change. J. Occup. Org. Psychol., 70, 273–293. Langan-Fox, J., Code, S., and Langfield-Smith, K. (2000). Team mental models: techniques, methods, and analytic approaches. Hum. Factors, 42, 242–271. Langan-Fox, J., Anglim, J., and Wilson, J. R. (2004). Mental models, team mental models, and performance: process, development, and future directions. Hum. Factors Ergon. Manuf., 14, 331–352. Lawless, C. J. (1994). Investigating the cognitive structure of students studying quantum theory in an Open University history of science course: a pilot study. Br. J. Educ. Technol., 25, 198–216. Lesh, R. and Dorr, H. (2003). A Models and Modeling Perspective on Mathematics Problem Solving, Learning, and Teaching. Mahwah, NJ: Lawrence Erlbaum Associates.* Levine, J. M. and Moreland, R. L. (1990). Progress in small group research. Annu. Rev. Psychol., 41, 585–634. Lincoln, Y. S. and Guba, E. G. (1985). Naturalistic Inquiry. Beverly Hills, CA: SAGE. Lingard, L. (2002). Team communications in the operating room: talk patterns, sites of tension, and implications for novices. Acad. Med., 77, 232–237. Losada, M. (1990). Collaborative Technology and Group Process Feedback: Their Impact on Interactive Sequences in Meetings. Paper presented at the Association for Computing Machinery Conference on Computer Supported Cooperative Work, October 7–10, Los Angeles, CA. Lowyck, J. and Elen, J. (2004). Linking ICT, knowledge domains, and learning support for the design of learning environments. In Instructional Design: Addressing the Challenges of Learning Through Technology and Curriculum, edited by N. Seel, and S. Dijkstra, pp. 239–256. Mahwah, NJ: Lawrence Erlbaum Associates.* Data Collection and Analysis Magliano, J. P., Trabasso, T., and Graesser, A. C. (1999). Strategic processing during comprehension. J. Educ. Psychol., 91, 615–629. Mathieu, J. E., Heffner, T. S., Goodwin, G. F., Salas, E., and Cannon-Bowers, J. A. (2000). The influence of shared mental models on team process and performance. J. Appl. Psychol., 85, 273–283. Mehrens, W.A., Popham, J. W., and Ryan, J. M. (1998). How to prepare students for performance assessments. Educ. Measure. Issues Pract., 17(1), 18–22. Meloy, J. M. (1994). Writing the Qualitative Dissertation: Understanding by Doing. Hillsdale, NJ: Lawrence Erlbaum Associates. Merrill, M. D. (2002). First principles of instruction. Educ. Technol. Res. Dev., 50(3), 43–55.* Michaelsen, L. K., Knight, A. B., and Fink, L. D. (2004). TeamBased Learning: A Transformative Use of Small Groups in College Teaching. Sterling, VA: Stylus Publishing. Miles, M. B. and Huberman, A. M. (1994). Qualitative Data Analysis: An Expanded Sourcebook, 2nd ed. Thousand Oaks, CA: SAGE. Miles, M. B. and Weitzman, E. A. (1994). Appendix: choosing computer programs for qualitative data analysis. In Qualitative Data Analysis: An Expanded Sourcebook, 2nd ed., edited by M. B. Miles and A. M. Huberman, pp. 311–317. Thousand Oaks, CA: SAGE. Moallem, M. (1994). An Experienced Teacher’s Model of Thinking and Teaching: An Ethnographic Study on Teacher Cognition. Paper presented at the Association for Educational Communications and Technology Annual Meeting, February 16–20, Nashville, TN. Morgan, D. L. (1996). Focus Groups as Qualitative Research Methods, 2nd ed. Thousand Oaks, CA: SAGE. Morris, L. L., Fitz-Gibbon, C. T., and Lindheim, E. (1987). How to Measure Performance and Use Tests. Newbury Park, CA: SAGE. Myllyaho, M., Salo, O., Kääriäinen, J., Hyysalo, J., and Koskela, J. (2004). A Review of Small and Large Post-Mortem Analysis Methods. Paper presented at the 17th International Conference on Software and Systems Engineering and their Applications, November 30–December 2, Paris, France. Newell, A. and Rosenbloom, P. (1981). Mechanisms of skill acquisition and the law of practice. In Cognitive Skills and Their Acquisition, edited by J. R. Anderson, pp. 1–56. Hillsdale, NJ: Lawrence Erlbaum Associates. Nitko, A. (2001). Educational Assessment of Students, 3rd ed. Upper Saddle River, NJ: Prentice Hall. Noldus, L. P. J. J., Trienes, R. J. H., Hendriksen, A. H. M., Jansen, H., and Jansen, R. G. (2000). The Observer VideoPro: new software for the collection, management, and presentation of time-structured data from videotapes and digital media files. Behav. Res. Methods Instrum. Comput., 32, 197–206. O’Connor, D. L. and Johnson, T. E. (2004). Measuring team cognition: concept mapping elicitation as a means of constructing team shared mental models in an applied setting. In Concept Maps: Theory, Methodology, Technology, Proceedings of the First International Conference on Concept Mapping Vol. 1, edited by A. J. Cañas, J. D. Novak, and F. M. Gonzalez, pp. 487–493. Pamplona, Spain: Public University of Navarra.* Olansen, J. B. and Rosow, E. (2002). Virtual Bio-Instrumentation. Upper Saddle River, NJ: Prentice Hall. Olkinuora, E., Mikkila-Erdmann, M., and Nurmi, S. (2004). Evaluating the pedagogical value of multimedia learning material: an experimental study in primary school. In Instructional Design: Addressing the Challenges of Learning Through Technology and Curriculum, edited by N. Seel and S. Dijkstra, pp. 331–352. Mahwah, NJ: Lawrence Erlbaum Associates. O’Neal, M. R. and Chissom, B. S. (1993). A Comparison of Three Methods for Assessing Attitudes. Paper presented at the Annual Meeting of the Mid-South Educational Research Association, November 10–12, New Orleans, LA. O’Neil, H. F., Wang, S., Chung, G., and Herl, H. E. (2000). Assessment of teamwork skills using computer-based teamwork simulations. In Aircrew Training and Assessment, edited by H. F. O’Neil and D. H. Andrews, pp. 244–276. Mahwah, NJ: Lawrence Erlbaum Associates.* Paas, F. (1992). Training strategies for attaining transfer of problem-solving skill in statistics: a cognitive load approach. J. Educ. Psychol., 84, 429–434. Paas, F. and van Merriënboer, J. J. G. (1993). The efficiency of instructional conditions: an approach to combine mental-effort and performance measures. Hum. Factors, 35, 737–743.* Paas, F. and van Merriënboer, J. J. G. (1994a). Instructional control of cognitive load in the training of complex cognitive tasks. Educ. Psychol. Rev., 6, 51–71. Paas, F. and van Merriënboer, J. J. G. (1994b). Variability of worked examples and transfer of geometrical problem-solving skills: a cognitive load approach. J. Educ. Psychol., 86, 122–133. Paas, F., Tuovinen, J. E., Tabbers, H., and Van Gerven, P. W. M. (2003). Cognitive load measurement as a means to advance cognitive load theory. Educ. Psychol., 38, 63–71. Paterson, B., Bottorff, J., and Hewatt, R. (2003). Blending observational methods: possibilities, strategies, and challenges. Int. J. Qual. Methods, 2(1), article 3. Patton, M. Q. (2001). Qualitative Research and Evaluation Methods, 3rd ed. Thousand Oaks, CA: SAGE. Paulsen, M. F. (2003). An overview of CMC and the online classroom in distance education. In Computer-Mediated Communication and the Online Classroom, edited by Z. L. Berge and M. P. Collins, pp. 31–57. Cresskill, NJ: Hampton Press. Pavitt, C. (1998). Small Group Discussion: A Theoretical Approach, 3rd ed. Newark: University of Delaware (http:// www.udel.edu/communication/COMM356/pavitt/). Pelto, P. J. and Pelto, G. H. (1978). Anthropological Research: The Structure of Inquiry, 2nd ed. Cambridge, U.K.: Cambridge University Press. Perez-Prado, A. and Thirunarayanan, M. (2002). A qualitative comparison of online and classroom-based sections of a course: exploring student perspectives. Educ. Media Int., 39(2), 195–202. Pirnay-Dummer, P. (2006). Expertise und modellbildung: Mitocar [Expertise and Model Building: Mitocar]. Ph.D. dissertation. Freiburg, Germany: Freiburg University. Popham, J. W. (1991). Appropriateness of instructor’s test-preparation practices. Educ. Meas. Issues Pract., 10(4), 12–16. Prichard, J. S. (2006). Team-skills training enhances collaborative learning. Learn. Instruct., 16, 256–265. Qureshi, S. (1995). Supporting Electronic Group Processes: A Social Perspective. Paper presented at the Association for Computing Machinery (ACM) Special Interest Group on Computer Personnel Research Annual Conference, April 6–8, Nashville, TN. 803 Tamara van Gog, Fred Paas et al. Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychol. Bull., 124, 372–422. Reigeluth, C. M. (1989). Educational technology at the crossroads: new mindsets and new directions. Educ. Technol. Res. Dev., 37 (1), 67–80.* Reilly, B. (1994). Composing with images: a study of high school video producers. In Proceedings of ED-MEDIA 94: Educational Multimedia and Hypermedia. Charlottesville, VA: Association for the Advancement of Computing in Education. Reiser, R. A. and Mory, E. H. (1991). An examination of the systematic planning techniques of two experienced teachers. Educ. Technol. Res. Dev., 39(3), 71–82. Rentsch, J. R. and Hall, R. J., Eds. (1994). Members of Great Teams Think Alike: A Model of Team Effectiveness and Schema Similarity among Team Members, Vol. 1, pp. 22–34. Stamford, CT: JAI Press. Rentsch, J. R., Small, E. E., and Hanges, P. J. (in press). Cognitions in organizations and teams: What is the meaning of cognitive similarity? In The People Make the Place, edited by B. S. B. Schneider. Mahwah, NJ: Lawrence Erlbaum Associates. Robinson, R. S. (1994). Investigating Channel One: a case study report. In Watching Channel One, edited by De Vaney, pp. 21–41. Albany, NY: SUNY Press. Robinson, R. S. (1995). Qualitative research: a case for case studies. In Instructional Technology: Past, Present and Future, 2nd ed., edited by G. J. Anglin, pp. 330–339. Englewood, CO: Libraries Unlimited. Ross, S. M. and Morrison, G. R. (2004). Experimental research methods. In Handbook of Research on Educational Communications and Technology, 2nd ed., edited by D. Jonassen, pp. 1021–1043. Mahwah, NJ: Lawrence Erlbaum Associates. Rourke, L., Anderson, T., Garrison, D. R., and Archer, W. (2001). Methodological issues in the content analysis of computer conference transcripts. Int. J. Artif. Intell. Educ., 12, 8–22. Rowe, A. L. and Cooke, N. J. (1995). Measuring mental models: choosing the right tools for the job. Hum. Resource Dev. Q., 6, 243–255. Russo, J. E., Johnson, E. J., and Stephens, D. L. (1989). The validity of verbal protocols. Mem. Cogn., 17, 759–769. Salas, E. and Cannon-Bowers, J. A. (2000). The anatomy of team training. In Training and Retraining: A Handbook for Business, Industry, Government, and the Military, edited by S. T. J. D. Fletcher, pp. 312–335. New York: Macmillan. Salas, E. and Cannon-Bowers, J. A. (2001). Special issue preface. J. Org. Behav., 22, 87–88. Salas, E. and Fiore, S. M. (2004). Why team cognition? An overview. In Team Cognition: Understanding the Factors That Drive Process and Performance, edited by E. Salas and S. M. Fiore. Washington, D.C.: American Psychological Association. Salomon, G. and Perkins, D. N. (1998). Individual and social aspects of learning. In Review of Research in Education, Vol. 23, edited by P. Pearson and A. Iran-Nejad, pp. 1–24. Washington, D.C.: American Educational Research Association.* Salvucci, D. D. (1999). Mapping eye movements to cognitive processes [doctoral dissertation, Carnegie Mellon University]. Dissert. Abstr. Int., 60, 5619. Sapsford, R. and Jupp, V. (1996). Data Collection and Analysis. London: SAGE. 804 Savenye, W. C. (1989). Field Test Year Evaluation of the TLTG Interactive Videodisc Science Curriculum: Effects on Student and Teacher Attitude and Classroom Implementation. Austin, TX: Texas Learning Technology Group of the Texas Association of School Boards. Savenye, W. C. (2004a). Evaluating Web-based learning systems and software. In Curriculum, Plans, and Processes in Instructional Design: International Perspectives, edited by N. Seel and Z. Dijkstra, pp. 309–330. Mahwah, NJ: Lawrence Erlbaum Associates. Savenye, W. C. (2004b). Alternatives for assessing learning in Web-based distance learning courses. Distance Learn., 1(1), 29–35.* Savenye, W. C. (2006). Improving online courses: what is interaction and why use it? Distance Learn., 2(6), 22–28. Savenye, W. C. (2007). Interaction: the power and promise of active learning. In Finding Your Online Voice: Stories Told by Experienced Online Educators, edited by M. Spector. Mahwah, NJ: Lawrence Erlbaum Associates. Savenye, W. C. and Robinson, R. S. (2004). Qualitative research issues and methods: an introduction for instructional technologists. In Handbook of Research on Educational Communications and Technology, 2nd ed., edited by D. Jonassen, pp. 1045–1071. Mahwah, NJ: Lawrence Erlbaum Associates. Savenye, W. C. and Robinson, R. S. (2005). Using qualitative research methods in higher education. J. Comput. Higher Educ., 16(2), 65–95. Savenye, W. C. and Strand, E. (1989). Teaching science using interactive videodisc: results of the pilot year evaluation of the Texas Learning Technology Group Project. In Eleventh Annual Proceedings of Selected Research Paper Presentations at the 1989 Annual Convention of the Association for Educational Communications and Technology in Dallas, Texas, edited by M. R. Simonson and D. Frey. Ames, IA: Iowa State University. Savenye, W. C., Leader, L. F., Schnackenberg, H. L., Jones, E. E. K., Dwyer, H., and Jiang, B. (1996). Learner navigation patterns and incentive on achievement and attitudes in hypermedia-based CAI. Proc. Assoc. Educ. Commun. Technol., 18, 655–665. Sax, G. (1980). Principles of Educational and Psychological Measurement and Evaluation, 2nd ed. Belmont, CA: Wadsworth. Schneider, W. and Shiffrin, R. M. (1977). Controlled and automatic human information processing. I. Detection, search, and attention. Psychol. Rev., 84, 1–66. Schweiger, D. M. (1986). Group approaches for improving strategic decision making: a comparative analysis of dialectical inquiry, devil’s advocacy, and consensus. Acad. Manage. J., 29(1), 51–71. Seel, N. M. (1999). Educational diagnosis of mental models: assessment problems and technology-based solutions. J. Struct. Learn. Intell. Syst., 14, 153–185. Seel, N. M. (2004). Model-centered learning environments: theory, instructional design, and effects. In Instructional Design: Addressing the Challenges of Learning Through Technology and Curriculum, edited by N. Seel and S. Dijkstra, pp. 49–73. Mahwah, NJ: Lawrence Erlbaum Associates. Seel, N. M., Al-Diban, S., and Blumschein, P. (2000). Mental models and instructional planning. In Integrated and Holistic Perspectives on Learning, Instruction, and Technology: Understanding Complexity, edited by J. M. Spector and T. M. Anderson, pp. 129–158. Dordrecht: Kluwer.* Data Collection and Analysis Segers, M., Dochy, F., and Cascallar, E., Eds. (2003). Optimising New Modes of assessment: In Search of Qualities and Standards. Dordrecht: Kluwer. Shepard, L. (2000). The role of assessment in a learning culture. Educ. Res., 29(7), 4–14. Shiffrin, R. M. and Schneider, W. (1977). Controlled and automatic human information processing. II. Perceptual learning, automatic attending, and a general theory. Psychol. Rev., 84, 127–190.* Shin, E. J., Schallert, D., and Savenye, W. C. (1994). Effects of learner control, advisement, and prior knowledge on young students’ learning in a hypertext environment. Educ. Technol. Res. Dev., 42(1), 33–46. Smith, P. L. and Wedman, J. F. (1988). Read-think-aloud protocols: a new data source for formative evaluation. Perform. Improv. Q., 1(2), 13–22. Spector, J. M. and Koszalka, T. A. (2004). The DEEP Methodology for Assessing Learning in Complex Domains. Arlington, VA: National Science Foundation.* Spradley, J. P. (1979). The Ethnographic Interview. New York: Holt, Rinehart and Winston.* Spradley, J. P. (1980). Participant Observation. New York: Holt, Rinehart and Winston.* Stahl, G. (2006). Group Cognition: Computer Support for Building Collaborative Knowledge. Cambridge, MA: MIT Press.* Staller, L. (2005). Understanding analog to digital converter specifications. [electronic version]. Embedded Syst. Design, February, 24, http://www.embedded.com/. Stelmach, L. B., Campsall, J. M., and Herdman, C. M. (1997). Attentional and ocular movements. J. Exp. Psychol. Hum. Percept. Perform., 23, 823–844. Stevenson, W. G. and Soejima, K. (2005). Recording techniques for electrophysiology. J. Cardiovasc. Electrophysiol., 16, 1017–1022. Strauss, A. L. and Corbin, J. M. (1994) Grounded theory methodology: an overview. In Handbook of Qualitative Research, edited by N. K. Denzin and Y. Lincoln, pp. 273–285. Thousand Oaks, CA: SAGE.* Sweller, J. (1988). Cognitive load during problem solving: effects on learning. Cogn. Sci., 12, 257–285.* Sweller, J., van Merriënboer, J. J. G., and Paas, F. (1998). Cognitive architecture and instructional design. Educ. Psychol. Rev., 10, 251–295. Sy, T. (2005). The contagious leader: Impact of the leader’s mood on the mood of group members, group affective tone, and group processes. J. Appl. Psychol., 90(2), 295–305. Taylor, K. L. and Dionne, J. P. (2000). Accessing problemsolving strategy knowledge: the complementary use of concurrent verbal protocols and retrospective debriefing. J. Educ. Psychol., 92, 413–425. Thompson, S. (2001). The authentic standards movement and its evil twin. Phi Delta Kappan, 82(5), 358–362. Thorndike, R. M. (1997). Measurement and Evaluation in Psychology and Education, 6th ed. Upper Saddle River, NJ: Prentice Hall.* Tiffin, J. and Rajasingham, L. (1995). In Search of the Virtual Class: Education in an Information Society. London: Routledge. Titchener, E. B. (1900). The equipment of a psychological laboratory. Am. J. Psychol., 11, 251–265.* Tuovinen, J. E. and Paas, F. (2004). Exploring multidimensional approaches to the efficiency of instructional conditions. Instruct. Sci., 32, 133–152. Underwood, G., Chapman, P., Brocklehurst, N., Underwood, J., and Crundall, D. (2003). Visual attention while driving: sequences of eye fixations made by experienced and novice drivers. Ergonomics, 46, 629–646. Underwood, G., Jebbett, L., and Roberts, K. (2004). Inspecting pictures for information to verify a sentence: eye movements in general encoding and in focused search. Q. J. Exp. Psychol., 57, 165–182. Urch Druskat, V. and Kayes, D. C. (2000). Learning versus performance in short-term project teams. Small Group Res., 31, 328–353. Van der Vleuten, C. P. M. and Schuwirth, L. W. T. (2005). Assessing professional competence: from methods to programmes. Med. Educ., 39, 309–317. Van Gerven, P. W. M., Paas, F., van Merriënboer, J. J. G., and Schmidt, H. (2004). Memory load and the cognitive pupillary response in aging. Psychophysiology, 41, 167–174. van Gog, T. (2006). Uncovering the Problem-Solving Process to Design Effective Worked Examples. Ph.D. dissertation. Heerlen: Open University of the Netherlands. van Gog, T., Paas, F., and van Merriënboer, J. J. G. (2005a). Uncovering expertise-related differences in troubleshooting performance: combining eye movement and concurrent verbal protocol data. Appl. Cogn. Psychol., 19, 205–221.* van Gog, T., Paas, F., van Merriënboer, J. J. G., and Witte, P. (2005b). Uncovering the problem-solving process: cued retrospective reporting versus concurrent and retrospective reporting. J. Exp. Psychol. Appl., 11, 237–244. Van Maanen, J. (1988). Tales of the Field: On Writing Ethnography. Chicago, IL: The University of Chicago Press. van Merriënboer, J. J. G. (1997). Training Complex Cognitive Skills: A Four-Component Instructional Design Model for Technical Training. Englewood Cliffs, NJ: Educational Technology Publications.* van Merriënboer, J. J. G., Jelsma, O., and Paas, F. (1992). Training for reflective expertise: a four-component instructional design model for complex cognitive skills. Educ. Technol. Res. Dev., 40(2), 1042–1629. Van Someren, M. W., Barnard, Y. F., and Sandberg, J. A. C. (1994). The Think Aloud Method: A Practical Guide to Modeling Cognitive Processes. London: Academic Press. VanLehn, K. (1996). Cognitive skill acquisition. Annu. Rev. Psychol., 47, 513–539.* Wainer, H. (1989). The future of item analysis. J. Educ. Meas., 26(2), 191–208. Webb, E. J., Campbell, D. T., Schwartz, R. D., and Sechrest, L. (1966). Unobtrusive Measures: Nonreactive Research in the Social Sciences. Chicago, IL: Rand McNally. Webb, N. M. (1982). Student interaction and learning in small groups. Rev. Educ. Res., 52(3), 421–445. Weitzman, E. A. and Miles, M. B. (1995). A Software Sourcebook: Computer Programs for Qualitative Data Analysis. Thousand Oaks, CA: SAGE. Willis, S. C., Bundy, C., Burdett, K., Whitehouse, C. R., and O’Neill, P. A. (2002). Small-group work and assessment in a problem-based learning curriculum: a qualitative and quantitative evaluation of student perceptions of the process of working in small groups and its assessment. Med. Teacher, 24, 495–501. Wolcott, H. F. (1990). Writing Up Qualitative Research. Newbury Park, CA: SAGE.* Woods, D. R., Felder, R. M., Rugarcia, A., and Stice, J. E. (2000). The future of engineering education. Part 3. Development of critical skills. Chem. Eng. Educ., 34, 108–117. 805 Tamara van Gog, Fred Paas et al. Woolf, H. (2004). Assessment criteria: reflections on current practices. Assess. Eval. Higher Educ., 29, 479–493.* Worchel, S., Wood, W., and Simpson, J. A., Eds. (1992). Group Process and Productivity. Newbury Park, CA: SAGE. 806 Yeo, G. B. and Neal, A. (2004). A multilevel analysis of effort, practice and performance: effects of ability, conscientiousness, and goal orientation. J. Appl. Psychol., 89, 231–247.* * Indicates a core reference.