Seeing the whole picture: evaluating automated assessment systems

Marian Petre

Seeing the whole picture: evaluating automated assessment systems

Marian Petre

2007

visibility

…

description

22 pages

link

1 file

Abstract: This paper argues that automated assessment systems can be useful for both students and educators provided that its results correspond well with human markers. Thus, evaluating such a system is crucial. We present an evaluation framework and show why it can be useful for both producers and consumers of automated assessment.

Seeing the Whole Picture: Evaluating Automated Assessment Systems Debra Trusso Haley, Pete Thomas, Anne De Roeck, Marian Petre The Centre for Research in Computing The Open University Walton Hall, Milton Keynes MK7 6AA UK D.T.Haley, P.G.Thomas, M. Petre, A.DeRoeck at open.ac.uk Abstract: This paper argues that automated assessment systems can be useful for both students and educators provided that its results correspond well with human markers. Thus, evaluating such a system is crucial. We present an evaluation framework and show why it can be useful for both producers and consumers of automated assessment. The framework builds on previous work to analyse Latent Semantic Analysis- (LSA) based systems, a particular type of automated assessment, that produced a research taxonomy that could help developers publish their results in a format that is comprehensive, relatively compact, and useful to other researchers. The paper contends that, in order to see a complete picture of an automated assessment system, certain pieces must be emphasised. It presents the framework as a jigsaw puzzle whose pieces join together to form the whole picture and provides an example of the utility of the framework by presenting some empirical results from our assessment system that marks questions about html. Finally, the paper suggests that the framework is not limited to LSA-based systems. With slight modifications, it can be applied to any automated assessment system. Keywords: automated assessment systems, computer aided assessment, CAA, Latent Semantic Systems, LSA systems; teaching programming ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 203 1. Introduction 1.1. Arguments for and against using an automated assessment system Assessment is an important component of teaching programmers. Researchers (Berglund, 1999; Daniels, Berglund, Pears, & Fincher, 2004) report that assessment can have a strong effect on student learning. Students learn best by frequent assessment with rapid feedback. Unfortunately, assessment can be an onerous task for educators. It takes time both to create the assessments and to mark them. Computers can reduce the time humans spend marking assessments. The educators can then use their time for more creative work. Educational institutions hope to save time, and therefore, money by using computerised marking systems. In addition to the possible time and cost savings, a computer offers some advantages over humans. Human markers may mark differently as they become fatigued as well as being affected by the order of marking. For example, if a marker first encounters a brilliant answer, the experience could cause the marker to be harsher for the remainder of the answers. Even the most scrupulous people might show bias based on personal feelings towards a student. While they may successfully avoid awarding better marks to their favourite students, they may mark non-favoured students more highly than they deserve in an attempt to be unbiased. Automatic markers can be an improvement over human markers because their results are reliable and repeatable. They do not get tired, they do not show bias based on personal feelings towards students, their results will be the same without regard to the order in which the answers are presented, and they are able to return results much faster than humans. The major objection to using automated assessment is concern over its accuracy. Not only is there no agreed-upon level of acceptable accuracy, there is no agreed-upon method by which to measure the accuracy of automated assessment system systems. Evaluation of the marking systems is a crucial topic because they will not be used if people do not have faith in their accuracy. We contend that an acceptable accuracy level would match the rate at which human markers correspond with each other. Another objection is that automatic marking takes away the human touch. We offer the suggestion that if an educator uses automatic marking, the time saved can be devoted to more personal contact with students. In addition, we would not entirely replace human markers with a computer. Our university uses multiple markers for high-stakes exams. A panel of experienced markers then moderates the marks where the humans don’t agree. An automatic assessment system could take the place of one of the human markers. By using a human and a computer to mark the same questions, educators can benefit from double-checking the computer with the human and vice versa. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 204 1.2. Some existing assessment systems Various automated assessment systems have been created to save time by automating marking. CourseMarker is an automated assessment tool for marking programs (http://www.cs.nott.ac.uk/~ceilidh/). Other automated assessment systems mark essays or short answers. For example, see (Burstein, Chodorow, & Leacock, 2003) for an assessment system that grades general knowledge essays and (Wiemer-Hastings, Graesser, & Harter, 1998 ) for a tutoring system that evaluates answers in the domain of computer science. As part of our work to improve the learning of programming and computing in general, we research automated assessment systems. We have developed a tool (Thomas, Waugh, & Smith, 2005) that is part of an online system to mark diagrams produced by students in a database course. We are developing EMMA (ExaM Marking Assistant) a Latent Semantic Analysis-(LSA) based automated assessment system (D. Haley, Thomas, De Roeck, & Petre, 2007) to mark short answers about html and other areas in computer science. LSA is a statistical natural language processing technique for analysing the meaning of text. We chose LSA because it has been used successfully in the past to mark general knowledge essays (Landauer, Foltz, & Laham, 1998) and shows promise in our area of short answers in the domain of computer science. This paper does not offer an LSA tutorial. Readers desiring a basic introduction to LSA should consult the references section (Landauer et al., 1998). We discuss LSA only as necessary to justify the need for our taxonomy and evaluation framework. Our work with EMMA has highlighted a significant challenge – the developer must choose many options that are intrinsic to the success of any LSA-based marking system. A review of the literature (D. T. Haley, Thomas, De Roeck, & Petre, 2005) revealed that although many researchers have reported work based on LSA, it is difficult to get a full picture of these systems. Some of the missing information includes type of training material and examples of questions being marked as well as fundamental LSA options, e.g., weighting function and number of dimensions in the reduced matrix. 1.3. Central theme of the paper The aim of this paper is to offer our two-part framework for automated assessment systems and to explain why it is necessary. It is based on a research taxonomy (D. T. Haley et al., 2005) we developed to compare Latent Semantic Analysis (LSA) based educational applications. The framework can be of value to both producers and consumers of automated assessment systems. Producers are researchers and developers who design and build assessment systems. They can benefit from the framework because it provides a relatively compact yet complete description of relevant information about their systems. If producers of automated assessment systems use the framework, they can contribute to the improvement of the state-of-the-art by adding to a collection of comparable data. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 205 Consumers are organisations, such as universities, that wish to use an automated assessment system. These consumers are, or should be, particularly interested in two areas. The first and most important area is the accuracy of the results. But what does accuracy mean and how do we measure it? We believe that an automated assessment system is good enough if its marks compare to human markers as well as human markers compare with each other. We have discussed various ways of measuring accuracy in previous work (D. Haley et al., 2007). Second, consumers should be interested in the amount of human effort required to use the assessment system. Most natural language processing assessment systems, including those based on LSA, require a large amount of training data. Although the system might save time for markers, it may take too much time to prepare the system for deployment (for example, to train the system for a specific data set) . It is difficult to compare automatic assessment systems because no uniform procedure exists for reporting results. This paper attempts to fill that gap by proposing a framework for reporting on and evaluating automatic assessment tools. 2 The framework The first part of the framework for describing an automated assessment system can be visualised as the jigsaw puzzle in Figure 1. Figure 2 shows the second part of the framework – the evaluation of the system. We contend that all the pieces of this puzzle must be present for a reviewer to see the whole picture. The important categories of information for specifying an automated assessment system are the items assessed, the training data, and the algorithm-specific technical details. The general type of question (e.g., essay and multiple choice) is crucial for indicating the power of a system. The granularity of the marking scale provides important information about the accuracy – it is usually easier for two markers to agree when they grade a 3 point question than one worth 100 points. The number of items assessed provides some idea of the generalise-ability and validity of the results. Both the number of unique questions and the number of examples of each question contribute to the understanding of the value of the results. The second category comprises the technical details of the algorithm used. Haley, et al (2005) discuss why these options are of interest to producers of an LSA-based automated assessment system. The central piece of Figure 1 shows LSA-specific options, but these would be changed if the automated assessment system is based on a different method. The data used to train the system is another crucial category. Both the type and amount of text help to indicate the amount of human effort needed to gather this essential element of automated assessment systems. Some systems (LSA for one (D. Haley et al., 2007)) need two types of training data – general text about the topic being marked and specific previously marked answers for calibration. Researchers should give details about both these types of training data. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 206 Figure 1. First part of framework: comparing automated assessment systems Anyone interested in developing or using an automated assessment system will be interested in its evaluation. The accuracy of the marks is of primary importance. An automated assessment system exhibiting poor agreement with human markers is of little value. Our previous work (D. T. Haley et al., 2005) showed that different researchers report their results using different methods. Ideally, all researchers would use the same method for easily comparable results. If researchers fail to reach a consensus on what information should be reported, they should at least clearly specify how they determined the accuracy of their results. The other two pieces of the evaluation picture are usability and effectiveness. These pieces are of interest to consumers wanting to choose among deployed systems. Figure 2. Second part of framework: evaluating automated assessment systems 3 Research taxonomy for LSA-based automated assessment systems This section summarises a research taxonomy developed in (D. T. Haley et al., 2005). It was the result of an in-depth, systematic review of the literature concerning Latent Semantic Analysis (LSA) research in the domain of educational applications. The taxonomy was designed to present and summarise the key points from a representative sample of the literature. The taxonomy highlighted the fact that others were having difficulty matching the results reported by the original LSA researchers (Landauer & Dumais, 1997). We found a lot of ambiguity in various critical implementation details (e.g. weighting function used) as well as unreported ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 207 details. We speculated that the conflicting or unavailable information explains at least some of the inability to match the success of the original researchers. The next subsections discuss the rationale for choosing certain articles over others and the meaning of the headings in the taxonomy. 3.1. Method for choosing articles The purpose of the taxonomy was to summarise and highlight important details from the LSA literature. Because the literature is extensive and our interest is in the assessment of essays and related artefacts, the taxonomy includes only those LSA research efforts that overlap with educational applications. The literature review found 150 articles of interest to researchers in the field of LSA-based educational applications. In order to limit this collection to a more reasonable sample, we constructed a citer – citee matrix of articles. That is, each cell entry (i, j) was non blank if article i cited article j. The articles ranged in date from perhaps the first LSA published article (Furnas et al., 1988), to one published in May 2005 (Perez et al., 2005). We found the twenty most-cited articles and placed them, along with the remaining 130 articles, in the categories shown in Table 1. Type of Article Number in Number in Lit Review Taxonomy most cited 20 13 LSA and ed. 43 15 LSA but not ed. apps. 13 0 LSI 11 0 theoretical / 11 0 reviews / summaries 11 0 ed. apps. but not LSA 41 0 Total 150 28 applications mathematical Table 1. Categories of articles in the literature review and those that were selected for the taxonomy We chose the twenty most-cited articles for the taxonomy. Some of these most-cited articles were early works explaining the basic theory of Latent Semantic Indexing (LSI).1 Although not strictly in our scope of the intersection of LSA and educational applications, we included some of these articles because of their seminal nature. Next, we added articles from the category that 1 Researchers trying to improve information retrieval produced the LSI theory. Later, they found that LSI could be useful to analyse text and created the term LSA to describe LSI when used for this additional area. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 208 combined educational applications with LSA that were of particular interest, either because of a novel domain or technique, or an important result. Finally, we decided to reject certain heavily cited articles because they presented no new information pertinent to the taxonomy. This left us with 28 articles in the taxonomy. 3.2. The taxonomy categories The taxonomy organises the articles involving LSA and educational applications research into three main categories: an Overview, Technical Details, and Evaluation. Figures 3, 4, and 5 show the headings and sub-headings. Most of the headings are self-explanatory; some clarifications are noted in the figures. Overview Who Where what the system/research is about / why the researcher(s) thought it worth doing What / Why Stage of Development / Type of work what the system is supposed to do Purpose Innovation Major result/ Key points Figure 3. Category A: Overview Technical Details Options choices for the researcher pre-processing # dimensions e.g. stemming, stop word removal of reduced matrix weighting function of term frequencies comparison measure how the closeness between 2 documents is determined Corpus size composition subject Evaluation terms these categories apply if the system assesses some kind of artefact; otherwise, the cells are shaded out accuracy number method used size documents number size type Human Effort item of interest the finer the granularity the harder it is to match human markers e.g. essay, short answer # items assessed # students x # questions on exam granularity of marks type Mostly prose text, although one is made from C programs (Nakov, 2000) and another has tuples representing moves made in a complex task (Quesada, Kintsch, & Gomez, 2001) any manual data manipulation required, e.g., marking up a text with notion; all LSA systems require a human to collect a corpus- this effort is not noted in the taxonomy Figure 4. Category B: Technical Details results human to LSA correlation human to human correlation effectiveness usability a successful LSA-based system should correlate to human markers as well as they correlate to each other whether or not student learning is improved ease of use Figure 5. Category C: Evaluation ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 209 Appendix A presents the taxonomy. When looking at it, the reader should keep a few points in mind. First, the taxonomy is three pages wide by three pages high. Pages 1-3 cover the overview for all of the articles in the taxonomy. Pages 4-6 list the technical details. Pages 7-9 give the evaluation information. Second, each line presents the data relating to one study. However, one article can report on several studies. In this case, several lines are used for a single article. The cells that would otherwise contain identical information are merged. Third, the shaded cells indicate that the data item is not relevant for the article being categorised. Fourth, blank cells indicate that we were unable to locate the relevant information in the article. Fifth, the information in the cells was summarised or taken directly from the articles. Thus, the Reference column on the far left holds the citation for the information on the entire row. Organising a huge amount of information in a small space is not easy. The taxonomy in the appendix is based on an elegant solution in (Price, Baecker, & Small, 1993). 4 Using the Framework for an automated assessment system Our framework for evaluating an automated assessment system is a refined version of the taxonomy discussed in the previous section. The experience of creating and using the taxonomy served to crystallize our thinking about the important elements of reporting on an automated assessment system. Table 2 is an example of how the framework could be used to compare different systems in tabular form. It starts with an overview and proceeds with the pieces in the puzzles of Figures 1 and 2. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 210 50 HTML The desired appearance It is very important to read this text carefully. It is very important to read this text carefully. Correct the following fragments of HTML. For each case, write the correct HTML and write one or two sentences about the problem with the original HTML. HTML Things to do: stem ming, stop words 1 text 90 log / cosin none 12k 1 text 45k para wor entrop e grap d y h 500 log / cosin none entrop e y The desired appearance Things to do: Pack suitcase, Book taxi. Pack suitcase, Book taxi. Training Data Evaluation Accuracy Reference Size HTD07 1) 45k paragraphs 2) 50 1) 45k paragraphs 2) 80 method used Composition 1) course texts 2) compared LSA marks with 5 human marked human markers and calculated answers average 1) course texts 2) compared LSA marks with 5 human markers and calculated human marked average answers averageg% 211 identical off by 1 off by 2 off by 3 off byg4 identical off by 1 off by 2 off by 3 off by 4 Effectiveness Human to LSA Human to Human Type Size Number Size Documents Type Number comparison measure matching threshold Terms weighting function # of item s asse sse d text of question 50 Correct the following fragments of HTML. For each case, write the correct HTML and write one or two sentences about the problem with the original HTML. # dimensions Algorithm-specific Technical Details preprocessing System Name EMMA ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 Refe rence HTD 07 Granul arity Major of Result / Markin Key Human Type of Innovati g points Effort on Scale Item marked amount of gather 4 short training question training answers points s about data that data, about gather html; works html marked determin best: 50 answers ed the marked optimum answers amount for of question A 4 training amount of points training data data that works best: 80 marked answers for B Items Assessed Does it improve learning? Usability How easy is it to use? 53 34 12 1 1 54 32 11 1 1 not relevant - a research prototype not relevant - a research prototype 43 45 6 3 3 61 28 9 1 1 not relevant - a research prototype not relevant - a research prototype Table 2. Filling in the framework OverView Stage of Devel opme nt/ What / Type Who / Why of Where assess resear Haley, Thomas, computer ch De science prototy pe short Roeck, answers Petre; for The summativ Open University e assessm ent Our previous work (D. T. Haley et al., 2005) highlighted the insights revealed by the taxonomy. The major conclusion was that researchers need to know all of the details to fully evaluate and compare reported results. The taxonomy contains many blank cells. This implies that much valuable information goes unreported. Research results cannot be reproduced and validated if researchers do not provide more detailed data regarding their LSA implementations. The framework (see figures 1 and 2) is an attempt to simplify the taxonomy and make it more concise. The information reports the results of a previous study (D. Haley et al., 2007) to determine the optimum amount of training data to mark questions about html. All of the relevant information concerning that study is in the table. The assessment system is called EMMA. It was developed by Haley, et al. to assess computer science short answers for summative assessment. EMMA is a research prototype – not yet a deployed system. The innovation of the study was to determine the optimum amount of training data and found that 50 marked answers were optimum for question A and 80 marked answers were optimum for question B. Each of the questions about html was worth 4 points and we evaluated 50 student answers per question. The table contains the text of the two questions. The table gives the information relating to LSA parameters. This may not be of interest to consumers of assessment systems but is vital for other researchers wishing to replicate the findings. We used 45,000 paragraphs from course textbooks to serve as general training data. To evaluate the results of EMMA, we compared the marks given by five humans and calculated the average. We then compared EMMA’s marks with each of the five humans and calculated the average. We found that EMMA worked better for question A than it did for Question B. Fifty-three percent of EMMA’s marks were identical to the human marks. Thirty-four percent of the marks differed by one point, 12% differed by two points, and 1% differed by three and four points. This compares to the human average agreement, which was 54, 32, 11, 1, and 1 for the same point differences. These figures suggest that EMMA produced very similar results to what the humans did for question A. The results were not as good for question B. The table gives the relevant figures. The previous paragraph repeats the information in the table. It is easier to use the table to compare our results with other system than it is to digest the text in the previous paragraph. The table gives all of the information specified in the framework in a reasonably concise form. 5 Conclusions Our framework will support sharing and comparison of results of further research into LSAbased automated assessment system tools. By providing all the pieces of the puzzle, researchers show the whole picture of their systems. The publication of all relevant details will lead to improved understanding and the continued development and refinement of LSA. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 212 Our work has involved an LSA-based system. However, the same benefits that accrue to LSA researchers by using the framework can also extend to broader automated assessment system research. The framework can be altered by replacing the LSA-specific technical details with the relevant information. We hope that by presenting this framework, we stimulate discussion amongst automated assessment system producers and consumers. The ultimate goal is to improve computing education by improving assessment. Acknowledgements The work reported in this study was partially supported by the European Community under the Innovation Society Technologies (IST) programme of the 6th Framework Programme for RTD project ELeGI, contract IST-002205. This document does not represent the opinion of the European Community, and the European Community is not responsible for any use that might be made of data appearing therein. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 213 References Bassu, D., & Behrens, C. (2003). Distributed LSI: Scalable concept-based information retrieval with high semantic resolution. In Proceedings of Text Mining 2003, a workshop held in conjunction with the Third SIAM Int'l Conference on Data Mining. San Francisco. Berglund, A. (1999). Changing Study Habits - a Study of the Effects of Non-traditional Assessment Methods. Work-in-Progress Report. Paper presented at the 6th Improving Student Learning Symposium, Brighton, UK. Berry, M. W., Dumais, S. T., & O'Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review 37, 4, 573-595. Burstein, J., Chodorow, M., & Leacock, C. (2003). Criterion Online Essay Evaluation: An Application for Automated Evaluation of Student Essays. In Proceedings of the Fifteenth Annual Conference on Innovative Applications of Artificial Intelligence. Acapulco, Mexico. Daniels, M., Berglund, A., Pears, A., & Fincher, S. (2004). Five Myths of Assessment. Paper presented at the 6th Australasian Computing Education Conference (ACE2004), Dunedin, New Zealand. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407. Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavioral Research Methods, Instruments & Computers, 23(2), 229-236. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated Essay Scoring: Applications to Educational Technology. Paper presented at the ED-MEDIA '99, Seattle. Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., et al. (1988). Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of 11th annual int'l ACM SIGIR conference on Research and development in information retrieval (pp. 465-480): ACM. Haley, D., Thomas, P., De Roeck, A., & Petre, M. (2007, 31 January 2007). Measuring Improvement in Latent Semantic Analysis-Based Marking Systems: Using a Computer to Mark Questions about HTML. Paper presented at the Proceedings of the Ninth Australasian Computing Education Conference (ACE2007), Ballarat, Victoria, Australia. Haley, D. T., Thomas, P., De Roeck, A., & Petre, M. (2005, 21-23 September 2005). A Research Taxonomy for Latent Semantic Analysis-Based Educational Applications. Paper presented at the International Conference on Recent Advances in Natural Language Processing'05, Borovets, Bulgaria. Kanejiya, D., Kumar, A., & Prasad, S. (2003). Automatic Evaluation of Students' Answers using Syntactically Enhanced LSA. Paper presented at the HLT-NAACL 2003 Workshop: Building Educational Applications Using Natural Language Processing. Kintsch, E., Steinhart, D., Stahl, G., Matthews, C., & Lamb, R. (2000). Developing summarization skills through the use of LSA-based feedback. Interactive Learning Environments. [Special Issue, J. Psotka, guest editor], 8(2), 87-109. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211-240. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284. Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E. (1997). How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans. In M. G. Shafto & P. Langley (Eds.), Proceedings of the 19th Annual Meeting of the Cognitive Science Society (pp. 412-417). Lemaire, B., & Dessus, P. (2001). A system to assess the semantic content of student essays. Journal of Educational Computing Research, 24(3), 305-320. Nakov, P. (2000). Latent Semantic Analysis of Textual Data. In Proceedings of the Int'l Conference on Computer Systems and Technologies. Sofia, Bulgaria. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 214 Nakov, P., Popova, A., & Mateev, P. (2001). Weight functions impact on LSA performance. In Proceedings of the EuroConference Recent Advances in Natural Language Processing (RANLP'01). Tzigov Chark, Bulgaria. Olde, B. A., Franceschetti, D. R., Karnavat, A., & Graesser, A. C. (2002). The Right Stuff: Do you need to sanitize your corpus when using Latent Semantic Analysis? In Proceedings of the 24th Annual Meeting of the Cognitive Science Society (pp. 708-713). Fairfax. Perez, D., Gliozzo, A., Strapparava, C., Alfonseca, E., Rodriquez, P., & Magnini, B. (2005). Automatic Assessment of Students' free-text Answers underpinned by the combination of a Bleu-inspired algorithm and LSA. Paper presented at the Proceedings of the 18th Int'l FLAIRS Conference, Clearwater Beach, Florida. Price, B. A., Baecker, R. M., & Small, I. S. (1993). A Principled Taxonomy of Software Visualization. Journal of Visual Languages and Computing, 4(3), 211-266. Quesada, J., Kintsch, W., & Gomez, E. (2001). A computational theory of complex problem solving using the vector space model (part 1): Latent Semantic Analysis, through the path of thousands of ants. Cognitive Research with Microworlds, 43-84, 117-131. Thomas, P. G., Waugh, K., & Smith, N. (2005). Experiments in the Automatic Marking of ERDiagrams. Paper presented at the Proceedings of the 10th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education, Monte de Caparica, Portugal. Wiemer-Hastings, P., Graesser, A., & Harter, D. (1998). The foundations and architecture of Autotutor. In Proceedings of the 4th International Conference on Intelligent Tutoring Systems (pp. 334-343). San Antonio, Texas. Wiemer-Hastings, P., Wiemer-Hastings, K., & Graesser, A. C. (1999). Improving an intelligent tutor's comprehension of students with Latent Semantic Analysis. In S. P. Lajoie & M. Vivet (Eds.), Artificial Intelligence in Education. Amsterdam: IOS Press. Wiemer-Hastings, P., & Zipitria, I. (2001). Rules for Syntax, Vectors for Semantics. In Proceedings of the 23rd Cognitive Science Conference. Wolfe, M. B. W., Schreiner, M. E., Rehder, B., Laham, D., Foltz, P. W., Kintsch, W., et al. (1998). Learning from text: Matching readers and texts by Latent Semantic Analysis. Discourse Processes, 25, 309-336. ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 215 ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 indexing not assessing essays System Name Appendix A The Latent Semantic Analysis Research Taxonomy Who Reference DDF90 What / Why Stage of Development/ Type of work Purpose Innovation Major Result / Key points Deerwester, Dumais, Furnas, Landauer, Harshman LSI research U of Chicago, Bellcore, U of explain new theory that W. Ontario overcomes the deficiencies of term-matching information retrieval LSI: explains SVD and dimension reduction steps for Med: for all but the two lowest levels of recall, precision of the LSI method lies well above that obtained with straight-forward term matching; no difference for CISI Dum91 Dumais Bellcore attempt better LSI results LSI research information retrieval compared different weighting functions log entropy best weighting function; stemming and phrases showed only 1-5% improvement; 40% better than raw frequency weighting BD095 Berry, Dumais, O'Brien U of Tenn, Bellcore explain new theory LSI research information retrieval LSI FBP94 Foltz, Britt, Perfetti LSA research New Mexico State matching summaries to University, Slippery Rock U, text read, determine if LSA U of Pittsburgh can work as well as coding propositions text comprehension to evaluate a reader's situation model FKL98 Foltz, Kintsch, Landauer matching summaries to text read, analyses knowledge structures of subjects and compares them to those generated by LSA LSI - completely automatic indexing method using SVD, shows how to do SVD updating of new terms representation generated by LSA is sufficiently simillar to the readers' situation model to be able to characterize the quality of their essays LD97 Landauer, Dumais Landauer, Laham, Rehder, Schreiner LLR97 RSW98 WSR98 216 Intelligent Essay Assessor (IEA) http://psych.nmsu.e du.essay Where FLL99 measure text coherence LSA research U of Colorado, BellCore explain new theory LSA research U of Colorado compared essays scores LSA theory given by readers and LSA, to determine importance of word order Rehder, Shreiner, U of Colorado Wolfe, Laham, Landauer, Kintsch Wolfe,Shreiner, Rehder, Laham, Foltz, Kintsch, Landauer Foltz, Landauer, Laham explore certain technical issues grading essays LSA research grading essays U of Colorado, New Mexico compared essay scores LSA research State Univ after reading one of 4 texts reports on various studies New Mexico State using LSA for automated University, Knowledge Analysis Technologies, U of essay scoring Colorado using LSA to measure text coherence select appropriate text practice essay writing deployed application for formative assessment investigating the importance of word order; combined quality (cosine) and quantity (vector length) LSA needs a corpus of at least 200 documents; online encyclopedia articles can be added LSA could be a model of human knowledge acquisition LSA predicted scores as well as human graders; separating tech and non-technical words made no improvement investigated technical vocabulary, essay length, optimal measure of semantic relatedness, and directionality of knowledge in the high dimensional nothing to be gained by separating essay into tech and non tech terms using LSA to select appropriate text LSA can measure prior knowledge to select appropriate texts cosine and length of essay vector are best predictors of mark Over many diverse topics, the IEA scores agreed with human experts as accurately as expert scores agreed with each other. System Name Appendix A The Latent Semantic Analysis Research Taxonomy Who Reference KSS00 What / Why Stage of Development/ Type of work Purpose Innovation Major Result / Key points U of Colorado, Platt Middle School http://www.k-a-t.com /cu.shtml helps students summarize deployed provide feedback on essays to improve reading application for length, topics covered, formative comprehension skills redundancy, relevance assessment graphical interface, optimal students produced better summaries and spent more time on task with Summary Street sequencing of feedback U of Colorado http://www.k-a-t.com /cu.shtml provide feedback on helps students summarize deployed essays to improve reading application for length, topics covered, redundancy, relevance formative comprehension skills assessment graphical interface, optimal the more difficult the text, the better was the result of using Summary Street, feedback sequencing of feedback doubled time on task U of Colorado explaining LSA Lan02b Landauer WWG99 Wiemer-Hastings, U of Memphis P., WiemerHastings, K, Graesser, A. test theory that LSA can facilitate more natural tutorial dialogue in an intelligent tutoring system (ITS) deployed assess short answers application for given to Intelligent formative Tutoring System assessment tested size and composition LSA works best when specific texts comprise of corpus for best LSA at least 1/2 of the corpus and the rest is results subject related; works best on essays > 200 words Wie00 Wiemer-Hastings U of Memphis determine effectiveness of LSA research assess short answers adding syntactic info to given to ITS LSA added syntactic info to LSA adding syntax decreased the effectiveness of LSA - as compared to Wie99 study WG00 Wiemer-Hastings, U of Memphis Graesser assess short answers given to ITS investigated types of corpora for best results WZ01 Wiemer-Hastings, U of Edinburgh Zipitria give meaningful feed back deployed on essays using agents application for formative assessment evaluate student answers LSA research for use in ITS assess short answers given to ITS combines rule-based syntactic processing with LSA - adds part of speech Nak00b Nakov Sofia University explore uses of LSA in textual research NPM01 Nakov, Popova, Mateev Sofia University evaluate weighting function LSA research analyse English literature compared 2 local weighting log entropy works better than classical for text categorisation texts times 6 global weighting entropy methods Summary Street Ste01 Kintsch, Steinhart, Stahl, LSA Research Group, Matthews, Lamb Steinhart Summary Street AutoTutor Select-aKibitzer ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 SLSA Structured LSA Where Page 2 of 9 LSA general research LSA research LSA works by solving a system of simultaneous equations best corpus is specific enough to allow subtle semantic distinctions within the domain, but general enough that moderate variations in terminology won't be lost adding structure-derived information improves performance of LSA; LSA does worse on texts < 200 words uses correlation matrix to display results; analysis of C programs 217 System Name Appendix A The Latent Semantic Analysis Research Taxonomy Who Reference FKM01 constructing different types of physics corpora to evaluate best type for an ITS Olde, U of Memphis, CHI Systems evaluate corpora with Franceschetti, different specificities for Karnavat, et al use in ITS Lemaire, Dessus U of Grenoble-II web-based learning system, automatic marking with feedback Purpose Innovation LSA research for formative assessment intelligent tutoring used 5 different corpora to compare vector lengths of words intelligent tutoring used 5 different corpora to compare essay grades improve LSI by addressing LSI research scalability problem information retrieval Major Result / Key points carefully constructed smaller corpus may provide more accurate representation of fundamental physical concepts than much larger one sanitizing the corpus provides no advantage Quesada, Kintsch, Gomez BB03 Bassu, Behrens Telcordia KKP03 Kanejiya, Kumar, Indian Institute of Prasad Technology evaluate student answers in an ITS LSA research intelligent tutoring augment each word with SELSA has limited improvement over LSA POS tag of preceding word, used 2 unusual measures for evaluation: MAD and Correct vs False evaluation NVA03 Nakov, Valchanova, Angelova U of Cal, Berkeley, Bulgarian Academy of Sciences investigating the most LSA research effective meaning of "word" text categorisation THD04 Thomas, Haley, DeRoeck, Petre The Open University assess computer science essays LSA research assess essays for summative assessment compared various methods linguistic pre-processing (stemming, POS of term weighting with NLP annotation, etc) does not substantially pre-processing improve LSA; proper term weighting makes more difference used a very small, very LSA works ok when the granularity is coarse; specifc corpus need to try a larger corpus necessitating a small # of dimensions PGS05 Perez, Gliozzo, Strapparava, Alfonseca, Rodriquez, Magnini U de Madrid; Istituto per la Ricerca Scientifica e Technologica web-based system to assess free-text answers LSA + ERB research Apex Distributed LSI U of Memphis Stage of Development/ Type of work QKG01a SELSA Atenea What / Why LSA research for formative assessment deployed application for formative assessment U of Colorado, U of Grenada investigate complex CPS and LSA problem solving using LSA research LD01 indexing not assessing essays ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 OFK02 Franceschetti, Karnavat, Marineau, et al Where Page 3 of 9 provide feedback on topic, outline and coherence LSA is a promising method to grade essays represent actions taken in a Microworld as tuples for LSA subdivide corpus into several homogeneous subcollections combine LSA with a BLEUinspired algorithm; ie combines syntax and semantics LSA is a promising tool for representing actions in Microworlds. a divide-and-conquer approach to IR not only tackles its scalability problems but actually increases the quality of returned documents achieves state-of-the-art correlations to the teachers' scores while keeping the languageindependence and without requiring any domain specific knowledge 218 Appendix A The Latent Semantic Analysis Research Taxonomy Options Page 4 of 9 Training Data Terms # Weightin Comparis Predimens g on processing ions function measure Reference DDF90 remove 439 stop words (from SMART) 5,823 Type Numbe r Size ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 1,033 averag e 50 title and abstract 100 cosine CISI information 5,135 words science abstracts 374 - 5831 words various (described in paper) 1,460 avg 45 words 82 1460 title and abstract 70-100 log entropy MED, CISI, CRAN, TIME, ADI cosine 100 cosine 100 cosine 300 ln(1+freq cosine )/entropy cosine 94 vector length 1500 RSW98 cosine Human Effort Type words none no stop words Size medical abstracts BD095 remove 439 stop words Number MED 60, 100 log entropy LLR97 Subject cosine remove 439 stop words (from SMART) LD97 Composition 100 Dum91 FBP94 Size Documents none 27.8 K 21 articles about the Panama Canal; Panama Canal 8 encyclopedia articles, excerpts from 2 books 21 articles on the the heart 4.6M Grolier's Academic American Encyclopedia 27 articles from Grolier's Academic Amer. Encyclopedia heart heart anatomy textbook psychology 27 articles fromGrolier's Academic Amer. Encyclopedia heart anatomy 4829 word prose text 2,781 words prose text 60.7k word prose 607 30.4k 3034 word prose 830 19,153 words prose 4,904 averag words e 151 words senten ce paragr aphs words separated essays into technical and non technical created subsections of essays WSR98 FLL99 100 cosine 17,880 36 encyclopedia articles a portion of the textbook heart psycholinguistics standardised test opinion essays 219 standardised test argument essays diverse 3,034 word prose Appendix A The Latent Semantic Analysis Research Taxonomy Options Page 5 of 9 Training Data Terms # Weightin Comparis Predimens g on processing ions function measure Reference KSS00 correct spelling ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 Ste01 cosine correct spelling Lan02 300 WWG99 200 Wie00 Size Composition specialized texts Subject Documents Numbe r Number Size Type heart and lung 17,688 1 word prose text 830 Meso-American history 46,951 1 word prose text 530 1 word prose text cosine general knowledge space sources of energy cosine specialized texts heart and lung cosine specialized texts Meso-American history Size Type prose text no pregraded summaries but mark up text into topics to appear in summaries prose text cosine cosine computer literacy 2.3 MB 2 complete computer literacy textbooks, ten articles on each of the tutoring topics, entire curriculum script including expected good answers yes , see human effort cosine computer literacy removed 440 stop words cosine log entropy collect good and bad answers 1 tuple subject verb object 1 tuple subject verb object WG00 WZ01 220 Nak00b removed 938 stop words 30 NPM01 removed stop words and those occuring only once 15 Human Effort 2.3 MB same as WWG99 log and or entropy 6 dot product 974 K different / cosine religious texts C programs Huckleberry Finn and Adventures of Sherlock Holmes computer literacy religion 20,433 196 C code 5534 words prose 487 2 KB prose segmented sentences into subject, verb, object tuples; resolved anaphora; resolved ambiguities with "and" and "or" researcher's task to find or create appropriate texts to serve as the corpus and comparison texts segmented sentences into subject, verb, object tuples; resolved anaphora and ambiguities with "and" and "or" Appendix A The Latent Semantic Analysis Research Taxonomy Options Page 6 of 9 Training Data Terms Reference FKM01 # Weightin Comparis Predimens g on processing ions function measure 300 OFK02 300 Size Composition ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 cosine physics text book and other science text books physics cosine physics text book + related to curriculum script physics LD01 290K + 3 French novels plus course text size of course text tuples representing actions in a Microworld QKG01a BB03 removed stop words NVA03 removed 442 0, 10, various stop words, 220, 40 stemming; POS none 10 log (no cosine global weighting ) PGS05 Number Size Type Numbe r Size paragr paragraph prepare specialised aph corpora from 1,564 to 6,536 word prose from paragr 416 to aph 3,778 75565 1 tuple 3441 prose sociology of education complex problem solving log entropy tf-idf cosine 2.3M computer literacy Bulgarian various - see paper for details 10 different corpora: student answers + text from popular computer magazines sanitize corpora; write "expectations" for each answer no pre-graded essays; mark up text into topics and notions 1 trial create a classification scheme for LSI vector spaces used Auto tutor corpus < 2,000 human marked answers to the essays computer literacy Human Effort Type various KKP03 THD04 Subject Documents 9,194 word word part of speech tags 5,596 paragr aph 17 1 paragr aph prose part of speech tagging prose none 221 Appendix A The Latent Semantic Analysis Research Taxonomy Page 7 of 9 Accuracy Results Granular ity of marks Method used Reference DDF90 evaluate using recall and precision Item of Interest Number items assessed Human to LSA correlation Human to Human correl ation ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 queries 30 queries 35 essay 24 TOEFL - multiple choice test short essay - 250 words 80 LSA: 64.4%; students: 64.5% 94 0.77 0.77 Effectiveness Usability Dum91 evaluate using recall and precision BD095 evaluate using recall and precision FBP94 compare against human graders 100 0.68 .367 to .768 compared sentences with cosine measure LD97 LLR97 compare against human graders 5 gold standard - a short text written by an expert compare against human graders RSW9 compare with 1 or more target texts 8 WSR9 compared with 4 texts of increasing difficulty and specificity 8 222 FLL99 holistic - compare with graded essays 5 point scale 94 0.72 273 0.64 0.65 0.63 0.77 0.8 0.73 695 0.86 0.86 668 0.86 0.87 1,205 0.701 0.707 short essay - 250 words 106 essay of about 250 words 106 essays average grade 85; after revisions, average grade 92 survey showed 98% of students would definitely or probably use system Appendix A The Latent Semantic Analysis Research Taxonomy Page 8 of 9 Accuracy Results Granular ity of marks Method used Reference KSS00 compare with teacher - provided topic list ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 Ste01 Item of Interest 10 summary of essay 10 summary of essay 50 5 summary of essay 108 5 10 Lan02 holistic, Pearson product-moment 5 or 10 correlation coefficient points WWG compare against pre-graded 99 answers for completeness and compatibility Wie00 compared tuples in student answer with tuples in expected answer WG00 WZ01 evaluate two texts using cosine Nak00 created correlation matrices b NPM01 defined precision as ration of chunks from same text to num of chunks at a level Number items assessed Human to LSA correlation 0.64 Human to Human correl ation 0.69 52 2: threshold short answers of .55 average length is 16 words 3,500 0.81 0.83 192 0.49 0.51 .18, .24, and .4 Usability no sig difference in classroom 1997-1999; students like immediate feed back scores of those using SS for difficult texts significantly higher than those not using SS in classroom 1997-1999; students like immediate feed back scores of those using SS for difficult texts significantly higher than those not using SS 52 essay Effectiveness 223 Appendix A The Latent Semantic Analysis Research Taxonomy Page 9 of 9 Accuracy Results Method used Reference FKM01 compared vector lengths of words for 5 different corpora Granular ity of marks ITALICS Volume 6 Issue 4, October 2007 ISSN: 1473-7507 OFK02 compared experts' marks against 5 LSA marks using a gold standard LD01 compare with teacher - provided topic list short answer 0-20 QKG0 compare LSA with human 1a assessment BB03 Item of Interest essay Number items assessed Human to LSA correlation 1,000 best result about .45 31 moves in Microworld 0.59 Human to Human correl ation 0.72 0.68 0.57 uses 2 similarity measures KKP03 used 20 good answers to each of 2 8 questions; correlation coefficient, MAD, correct vs false evaluations 192 0.47 0.59 NVA03 THD04 use Spearman's rho correlation to 8,2,7 compare average human grade with LSA grade essay PGS05 Pearson correlation coefficient between humans' scores and Atenea's scores short essays 18 only 1 set was correlated statistically 0.5 Effectiveness not clear no sig difference between 3 groups - 1 control - no help 2 - human help provided; 3 Apex help Usability 224

Log In

Seeing the whole picture: evaluating automated assessment systems

Related papers

Related papers