Literature Review final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Literature Review on Automated Essay Scoring (AES)

Systems

Introduction
An Automated Essay Scoring (AES) system is used to automatically evaluate and score
essays written in response to a question prompt, based on computational techniques. The
main aim is to mimic or improve the way human raters score with respect to reliability,
objectivity and efficiency of scoring. These systems are broadly applied in educational and
standardized testing applications, which involve rapidly increasing supply content evaluation,
including essays. Since then, AES systems have developed from simple rule-based systems to
complex machine learning and deep learning models. In this literature review, we are going to
analyse what kind of methods used by the researchers in applying AES and some key
research contributions on different areas including its methodologies, problems and
advancement in the field. [8] [9] [11]

Traditional Approaches

Automated essay scoring has a long history. In the early 1960s, the Project Essay Grade
system (PEG ) [1] one of the first automated essay scoring systems, was developed. In the
first attempts, four raters defined criteria (proxy variables) while assessing 276 essays in
English by 8th to 12th graders. The system uses a standard multiple regression analysis
involving defined proxy variables (text features such as document length, grammar, or
punctuation) representing the independent variables with the human rated essay score being
the dependent variable. With a multiple correlation coefficient of 0.71 as an overall accuracy
for both the training and test set and given the good prediction of the human score for the test
set based on derived weights from the training set, the computer reached comparable results
to humans. However, with a correlation of 0.51 for the average document length the PEG
could be easily manipulated by writing longer texts. Thus, this kind of score prediction
considers only surface aspects and ignores semantics and the content of the essays.
In subsequent studies, the n-gram model based on the bag of words approach (BOW)
was commonly used for a number of decades. BOW models extract features from student
essays, in the case of the n-gram model, by counting the occurrences of terms consisting of n
words. They then consider the number of shared terms between essays of the same class and
model their relationship. An often-cited AES system using BOW is the Electronic Essay
Rater (e-rater for short) was developed. The e-rater predicts scores for essay responses and
was originally applied to responses written by non-native English speakers taking the
Graduate Management Admission Test (GMAT; 13 considered questions) and the Test of
English as a Foreign Language (TOEFL; two considered questions). Two features used in the
e-rater are extracted by content vector analysis programs named EssayContent and
ArgContent. While the former uses a BOW approach on the full response, the latter
(ArgContent) uses a weighted BOW approach for each required argument in a response.
Using these features alone results in average accuracies of 0.69 and 0.82, respectively.
Including all 57 predictive features of the e-rater, the accuracy ranges from 0.87 to 0.94 (the
number of responses varies between 260 and 915 for each question, with a mean of 638) For
a general overview of the types of relevant features that are used in such feature-based
approaches, we additionally recommend the paper by Ke and Ng. [2]
The above models are based on laborious, theory-based feature extraction methods.
Therefore, methods such as latent semantic analysis (LSA) were established. LSA are corpora
specific and trained on texts that are related to the given essay topic. They provide a semantic
representation of an essay, which is then compared to the semantic representation of other
similarly scored responses. In this way, a feature vector, or embedding vector, for an essay is
constructed, which is then used to predict the score. Foltz, Laham, and Landauer used LSA in
the intelligent essay assessor system and compared the accuracy of the system against two
human raters. The analysis is based on 1205 essay answers on 12 diverse topics given in the
GMAT. The system achieves a correlation of 0.70-0.71 accuracy.

Approaches Based on Neural Networks

AES systems advanced significantly with the introduction of more sophisticated natural
language processing (NLP) techniques. Burstein et al. developed the e-rater system [2],
which used linguistic features and statistical models to evaluate essays. Similarly, the
Intelligent Essay Assessor (IEA) [3] employed Latent Semantic Analysis (LSA) to
compare student essays with reference texts, capturing more semantic and stylistic nuances
than earlier models.
With the rise of machine learning and deep learning, AES systems became more
effective. Taghipour & Ng (2016) [4] applied Recurrent Neural Networks (RNNs) to AES,
significantly improving performance over feature-based models. Dong et al. (2017) [5]
further extended this by using a multi-aspect RNN to jointly assess grammar, content, and
style. The most recent advancements have come from Transformer-based architectures [6],
with researchers like Zhao, Hou, and Du (2020) using self-attention mechanisms to capture
complex relationships between words in longer, more complex essays.

Methodological Background on Recurrent Neural Nets (RNN)


RNNs are a relatively old technique in terms of neural network structures. The two most
prominent types of RNNs are the gated recurrent units (GRU) and the long short-term
memory (LSTM). The corresponding models are largely responsible for breakthrough results
in speech recognition and also dominated the field of NLP until the appearance of
transformer-based models. While the GRU is very efficient for smaller datasets and shorter
text sequences, the LSTM also allows longer text sequences to be modelled by integrating a
memory cell that can pass information relatively easily from one point to another within a
sequence. In particular, the LSTM thereby allowed the training of language models with
contextual embedding, where the position of a token in a sequence would make a difference
for its meaning. Due to their sequential structure, RNNs in general, though, have the
disadvantage that training them is very slow. Furthermore, both GRU and LSTM suffer from
the “vanishing gradient problem”, which describes an optimization problem that traditionally
occurs in large neural nets and thereby makes the training of LSTMs for very long sequences,
for example, impractical.
Taghipour and Ng applied RNNs based on self-trained word embeddings, and their
combination of a CNN and a bidirectional LSTM model significantly outperforms the two
baseline models, based on support vector regression and Bayesian linear ridge regression,
while also outperforming models based on just the LSTM, the CNN, or a GRU.
Alikaniotis and colleagues also analysed the Kaggle dataset and proposed a new, score
specific embedding approach in combination with a bidirectional LSTM, yielding a quadratic
weighted Kappa of 0.96.
Further work, also using the ASAP, was conducted by Li et al. and Jin et al. who used a
two-stage approach focusing on scoring texts originating from different prompts. They
integrated prompt independent knowledge into training in the first step and a neural net for
more prompt dependent features, and showed the benefits of this approach for cross prompt
AES.

Methodological Background on Transformer Models

[7] For the application of a transformer model, it is fundamental to understand the three
basic architecture types of a transformer: encoder models, decoder models, and encoder–
decoder models. The tasks they are predominantly used for, and exemplary models that have
been implemented and trained based on the respective type.
In general, transformer models are neural networks based on the so-called attention
mechanism and were originally introduced in the context of language translation. The
attention mechanism was presented in 2014 by Bahdanau et al. They showed that, instead of
encoding a text from the source language into a vector representation and then decoding the
vector representation into the text of the target language, the attention mechanism allows this
bottleneck of a vector representation to be avoided between the encoder and decoder by
allowing the model to directly search for relevant tokens in the source text when predicting
the next token for the target text.
While the encoder and decoder models for translation tasks before were mainly based on
RNNs, they showed that they can be implemented based on an attention mechanism alone.
Before, the attention mechanism was only used to allow models for translation tasks to
consider the relevance of single words in the source language for the prediction of a single
word in the target language, which yields a more integral connection of the encoder and
decoder model. To implement the encoder and decoder without the RNNs, Vaswani et al.
implemented a self-attention mechanism for the texts of the source and target language,
respectively, in which different attention layers tell the model to focus on informative words
and neglect irrelevant words in the texts. They showed that, this way, the model achieves new
performance records on several translation benchmarks while having a fraction of the training
cost compared to the best models previously used.
A major advantage of the transformer architecture is that it allows the parallel processing
of the input data and is not affected by the vanishing gradient problem. This makes it possible
to train with larger and larger datasets, resulting in better and better language models. Today,
transformer-based architectures are applied to a large variety of tasks, not only in NLP but
also in fields such as computer vision, predictions based on time series or tabular data, or
multimodal tasks. In addition, in the majority of fields, transformer-based models are
currently leading the benchmarks. The additional capabilities of the model come with a price,
however. While models pre-trained with enormous datasets increase performance, the
computational processing of billions of parameters is costly and time consuming. Moreover,
it restricts the training of such models to large corporations or governmental institutions and
could, therefore, prevent independent researchers or smaller organizations from gaining
equivalent access to such models.

The encoder–decoder architecture describes a sequence-to-sequence model, as proposed


in the original “Attention is All You Need” paper. This type of architecture is particularly
trained to transform (in the case of an NLP task) a text sequence of a certain format into a text
sequence of another format, such as in translation tasks.
The encoder architecture includes (as indicated by the name) only the input, or left-hand
side, of the original transformer architecture and transforms each input sequence into a
numerical representation. Typically, encoder models such as BERT, RoBERTa, or ALBERT
are specially used for text classification or extractive question answering. Since AES is a
special case of text classification, this type of model is, therefore, particularly well-suited to
AES, and, accordingly, the use of the BERT model has become increasingly prominent in the
literature on AES.
Finally, the decoder architecture includes only the output, or right-hand side, of the
original transformer model. In the original model, the decoder takes the information of the
input sequence from the encoder and generates a stepwise procedure one token after the other
for the output. At each step, the decoder not only considers the information from the decoder,
but also combines it with the information given by the output tokens that had already been
generated. Using only the decoder architecture, therefore, results in a model that generates
new tokens based on the tokens it had already generated or based on a sequence of initial
tokens it was provided with (typically called “prompt”). Decoder models, such as the GPT-3
or the Transformer XL, are therefore applied to generate text outputs. Although, by providing
specifically designed prompts, such models can be used for other tasks such as translation or
classification tasks as well.
All mentioned models from the different architecture types have in common that they
are, in a first step, trained, self-supervised, on large text corpora to calibrate powerful
language models. In a second step, they are then fine tuned to a specific task using supervised
learning and hand coded labels (exceptions are models such as the GPT-3, which have been
successfully applied to a variety of tasks without further fine tuning, but by designing
appropriate prompts only). This way, the language models can transfer the learned knowledge
to any more specific task (transfer learning).
Thus, innovative approaches such as high dimensionality, attention-based transformers
moved into the focus, as they process sequential data in parallel and are highly context
sensitive.[10]

Summary of Literature Review

The Automated Essay Scoring (AES) systems have evolved significantly from rule-based
models to advanced machine learning and deep learning frameworks. Initially, systems like
Project Essay Grade (PEG) relied on basic features such as word count, grammar, and
punctuation, but these models had limited ability to evaluate deeper aspects like content and
semantics. Early attempts, such as PEG, used regression analysis but struggled with
problems like overemphasis on document length.

The development of more sophisticated models, such as e-rater and Intelligent Essay
Assessor (IEA), introduced natural language processing (NLP) and Latent Semantic
Analysis (LSA), improving AES performance by analysing the semantic similarity of essays.
Despite these improvements, these models were still limited in handling complex essays and
unconventional writing styles.

The next significant leap in AES came with Recurrent Neural Networks (RNNs) and Long
Short-Term Memory (LSTM) networks. These models could handle sequential text and
understand context better. Researchers like Taghipour & Ng (2016) demonstrated that RNN-
based models outperformed traditional feature-based models in essay scoring. Other
advances, such as multi-aspect RNNs, considered several dimensions like content, grammar,
and style jointly, further improving AES accuracy.
Recent breakthroughs have been driven by Transformer-based architectures, such as
BERT and GPT. These models, particularly the self-attention mechanism, enabled AES
systems to process essays more contextually and capture complex word relationships, greatly
enhancing the quality of scoring for long and intricate essays. For example, Zhao, Hou, and
Du (2020) showed how Transformer-based models outperform RNNs and LSTMs in tasks
involving complex sentence structures.

Identified Gaps in AES implementation and Practices


1.Lack of Domain Specialization: AES systems may struggle when scoring essays for
specialized sectors like the IAS (Indian Administrative Services) exams, where domain-
specific knowledge, analytical reasoning, and sector-specific issues are vital. General AES
models may fail to recognize the depth of insight required for specialized fields, leading to
unfair scores.
2.Inability to Process Handwritten Essays: Most AES systems rely on typed input and
cannot easily handle handwritten essays. Converting handwritten text into digital form using
OCR (optical character recognition) introduces additional challenges, such as
misinterpretation of handwriting styles, errors in text conversion, and difficulty in
recognizing structural nuances in the writing.
3.Lack of Adaptation to Varying Writing Styles: In competitive sectors like IAS,
candidates often employ sophisticated and varying writing styles. Generalized AES systems
may not account for these differences, leading to inaccuracies in scoring essays that require a
higher level of comprehension or unique rhetorical strategies.
4.Limited Evaluation of Subjective and Critical Thinking: Specialized exams like IAS
require evaluating essays on criteria such as critical thinking, depth of analysis, and balanced
arguments. Automated systems, particularly those not fine-tuned for the domain, struggle
with assessing complex thought processes or the quality of argumentation.

Proposed Model
This proposed Automated Essay Scoring (AES) system is designed specifically for the Indian
Administrative Services (IAS) examination, addressing the specialized needs of civil service
essay evaluations. By incorporating domain-specific knowledge, the system is able to assess
critical skills like argument structure, logical coherence, and critical thinking, which are
crucial for candidates aiming for roles in public administration and governance. The system
also supports Optical Character Recognition (OCR) technology to handle both typed and
handwritten essays, ensuring a fair and inclusive assessment process for candidates
submitting essays in different formats.

In addition to its advanced scoring capabilities, the system uses Natural Language Processing
(NLP) techniques such as BERT embeddings and machine learning models to capture deep
contextual understanding and assess content relevance. The model is aligned with IAS
grading rubrics to ensure consistency with human evaluators' standards, and it includes a bias
detection layer to prevent unfair scoring based on demographic factors like language
proficiency or socio-economic background. Overall, the system offers a sophisticated and
equitable solution for essay evaluation in civil services exams.
FRAMEWORK FOR MODEL
FORMULATION
01 02 03 04

Essay collection
Choosing
and criterion Transforming
appropriate
selection Cleaning the text into
machine
happens in this data numerical
learning model
phase features

Data Feature
Data ColleCtion Model SeleCtion
PreproCessing ExtraCtion

It involves identifying decision


This provides a structured variables, setting objective
Additions: approach to defining and solving
complex problems.
functions, and establishing
constraints to guide optimization or
prediction
By providing clarity and structure, the framework also facilitates sensitivity analysis and
model validation, allowing for adjustments and improvements to be made as new data or
insights emerge

05 06 07 08

Using LLM
Assessing model Implementing Ensuring domain
model to
performance trained model expertise in the
provide
and scoring into a web or scoring process
personalised
mobile interface
feedback

FeedbaCk
Evaluation Deployment Fine Tuning
Generation

The framework helps simplify the


problem, making it easier to Divergent production abilities are
Additions: analyze and implement solutions
using mathematical or
those which are not guided by rules
or conventions
computational techniques
Raw Architecture of AES model
References:
1. [1] E. B. Page, *Project Essay Grade*. 1966. [Online]. Available: https://www.encyclopedia.com/arts/educational-magazines/page-ellis-batten-
1924-2005 [Accessed: Sept. 22, 2024].

2. [2] J. Burstein, "The E-rater® Scoring Engine," *Research Gate*, 1999. [Online]. Available:
https://www.researchgate.net/publication/274309476_Attali_Y_Burstein_J_2006_Automated_essay_scoring_with_e-
raterR_V2_Journal_of_Technology_Learning_and_Assessment_43 [Accessed: Sept. 22, 2024].

3. [3] T. K. Landauer, "The Intelligent Essay Assessor (IEA)," *Research Gate*, 2003. [Online]. Available:
https://www.researchgate.net/publication/256980328_The_Intelligent_Essay_Assessor [Accessed: Sept. 22, 2024].

4. [4] K. Taghipour and H. T. Ng, "Deep learning for automated essay scoring," *Research Gate*, 2016. [Online]. Available:
https://www.researchgate.net/publication/305748202_A_Neural_Approach_to_Automated_Essay_Scoring [Accessed: Sept. 22, 2024].

5. [5] Y. Dong, Y. Zhang, and Z. Yang, "A multi-aspect recurrent neural network for automated essay scoring," *Research Gate*, 2017. [Online].
Available: https://www.researchgate.net/publication/318739534_Attention-
based_Recurrent_Convolutional_Neural_Network_for_Automatic_Essay_Scoring [Accessed: Sept. 22, 2024].

6. [6] S. Ludwig, C. Mayer, C. Hansen, K. Eilers, and S. Brandt, "A transformer-based framework for automated essay scoring," *arXiv preprint*,
2021. [Online]. Available: https://arxiv.org/pdf/2110.06874 [Accessed: Sept. 22, 2024].

7. [7] X. Zhang, "Automated essay scoring using transformer models," *MDPI*, vol. 3, no. 4, 2023. [Online]. Available:
https://www.mdpi.com/2624-8611/3/4/897 [Accessed: Sept. 22, 2024].

8. [8] Kaixun Yang, Mladen Rakovic, Yuyang Li, Quanlong Guan, Dragan Gašević, Guanliang Chen, "Unveiling the tapestry of automated essay
scoring: A comprehensive investigation of accuracy, fairness, and generalizability," *arXiv preprint*, 2023. [Online]. Available:
https://ar5iv.labs.arxiv.org/html/2401.05655 [Accessed: Sept. 22, 2024].

9. [9] Yuning Ding, Marie Bexte, and Andrea Horbach, "A multi-task learning study on automatic scoring of argumentative essays," *ACL
Anthology*, 2023. [Online]. Available: https://aclanthology.org/2023.findings-acl.825.pdf [Accessed: Sept. 22, 2024].

10. [10] Christopher M. Ormerod, Akanksha Malhotra, Amir Jafari, "Automated essay scoring using efficient transformer-based language models,"
*arXiv preprint*, 2023. [Online]. Available: https://ar5iv.labs.arxiv.org/html/2102.13136 [Accessed: Sept. 22, 2024].
11. [11] hangrong Xiao, Wenxing Ma, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu, "From automation to augmentation: Large language
models elevating essay scoring," *arXiv preprint*, 2023. [Online]. Available: https://ar5iv.labs.arxiv.org/html/2401.06431v2 [Accessed: Sept. 22,
2024].

12. [12] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint
arXiv:1409.0473, 2014. [Online]. Available: https://arxiv.org/abs/1409.0473. [Accessed: Sept. 22, 2024].

13. [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need,"
arXiv preprint arXiv:1706.03762, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762. [Accessed: Sept. 22, 2024].

You might also like